Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16013#discussion_r89675866
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/random/StratifiedSamplingUtils.scala 
---
    @@ -35,13 +35,14 @@ import org.apache.spark.rdd.RDD
      * high probability. This is achieved by maintaining a waitlist of size 
O(log(s)), where s is the
      * desired sample size for each stratum.
      *
    - * Like in simple random sampling, we generate a random value for each 
item from the
    - * uniform  distribution [0.0, 1.0]. All items with values <= min(values 
of items in the waitlist)
    - * are accepted into the sample instantly. The threshold for instant 
accept is designed so that
    - * s - numAccepted = O(sqrt(s)), where s is again the desired sample size. 
Thus, by maintaining a
    - * waitlist size = O(sqrt(s)), we will be able to create a sample of the 
exact size s by adding
    - * a portion of the waitlist to the set of items that are instantly 
accepted. The exact threshold
    - * is computed by sorting the values in the waitlist and picking the value 
at (s - numAccepted).
    + * Like in simple random sampling, we generate a random value for each 
item from the uniform
    + * distribution [0.0, 1.0]. All items with values less than or equal to 
min(values of items in the
    + * waitlist) are accepted into the sample instantly. The threshold for 
instant accept is designed
    + * so that s - numAccepted = O(sqrt(s)), where s is again the desired 
sample size. Thus, by
    + * maintaining a waitlist size = O(sqrt(s)), we will be able to create a 
sample of the exact size
    + * s by adding a portion of the waitlist to the set of items that are 
instantly accepted. The exact
    + * threshold is computed by sorting the values in the waitlist and picking 
the value at
    + * (s - numAccepted).
    --- End diff --
    
    Here, simply from `<=` to `less than or equal to`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to