Github user falaki commented on a diff in the pull request:
https://github.com/apache/spark/pull/1025#discussion_r14542796
--- Diff:
core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala ---
@@ -45,11 +50,74 @@ private[spark] object SamplingUtils {
val fraction = sampleSizeLowerBound.toDouble / total
if (withReplacement) {
val numStDev = if (sampleSizeLowerBound < 12) 9 else 5
- fraction + numStDev * math.sqrt(fraction / total)
+ math.max(1e-10, fraction + numStDev * math.sqrt(fraction / total))
} else {
val delta = 1e-4
val gamma = - math.log(delta) / total
- math.min(1, fraction + gamma + math.sqrt(gamma * gamma + 2 * gamma *
fraction))
+ math.min(1,
+ math.max(1e-10, fraction + gamma + math.sqrt(gamma * gamma + 2 *
gamma * fraction)))
+ }
+ }
+}
+
+/**
+ * Utility functions that help us determine bounds on adjusted sampling
rate to guarantee exact
+ * sample sizes with high confidence when sampling with replacement.
+ *
+ * The algorithm for guaranteeing sample size instantly accepts items
whose associated value drawn
+ * from Pois(s) is less than the lower bound and puts items whose value is
between the lower and
+ * upper bound in a waitlist. The final sample is consisted of all items
accepted on the fly and a
+ * portion of the waitlist needed to make the exact sample size.
+ */
+private[spark] object PoissonBounds {
+
+ val delta = 1e-4 / 3.0
+
+ /**
+ * Compute the threshold for accepting items on the fly. The threshold
value is a fairly small
+ * number, which means if the item has an associated value < threshold,
it is highly likely to
+ * be in the final sample. Hence we accept items with values less than
the returned value of this
+ * function instantly.
+ *
+ * @param s sample size
+ * @return threshold for accepting items on the fly
+ */
+ def getLowerBound(s: Double): Double = {
--- End diff --
Let's combine this function with getUpperBound() into a single
getPoissonBounds() function that returns a tuple. There is good overlap between
the two functions and they are used in the same place.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---