Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1025#discussion_r14694253
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala ---
    @@ -45,11 +50,78 @@ private[spark] object SamplingUtils {
         val fraction = sampleSizeLowerBound.toDouble / total
         if (withReplacement) {
           val numStDev = if (sampleSizeLowerBound < 12) 9 else 5
    -      fraction + numStDev * math.sqrt(fraction / total)
    +      math.max(1e-10, fraction + numStDev * math.sqrt(fraction / total))
         } else {
           val delta = 1e-4
           val gamma = - math.log(delta) / total
    -      math.min(1, fraction + gamma + math.sqrt(gamma * gamma + 2 * gamma * 
fraction))
    +      math.min(1,
    +        math.max(1e-10, fraction + gamma + math.sqrt(gamma * gamma + 2 * 
gamma * fraction)))
    +    }
    +  }
    +}
    +
    +/**
    + * Utility functions that help us determine bounds on adjusted sampling 
rate to guarantee exact
    + * sample sizes with high confidence when sampling with replacement.
    + *
    + * The algorithm for guaranteeing sample size instantly accepts items 
whose associated value drawn
    --- End diff --
    
    I think the documentation is wrong. The bounds are not used as thresholds 
in sampling with replacement. The lower bound q1 guarantees that if sampling 
according to Pois(q1/n) the sample size is smaller than s, while the upper 
bound q2 guarantees that if sampling according to Pois(q2/n) the sample size is 
greater than s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to