[GitHub] spark pull request: [SPARK-2937] Separate out samplyByKeyExact as ...

mengxr Sun, 10 Aug 2014 10:07:34 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1866#discussion_r16032165
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala 
---
    @@ -133,68 +133,64 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
        * Return a subset of this RDD sampled by key (via stratified sampling).
        *
        * Create a sample of this RDD using variable sampling rates for 
different keys as specified by
    -   * `fractions`, a key to sampling rate map.
    -   *
    -   * If `exact` is set to false, create the sample via simple random 
sampling, with one pass
    -   * over the RDD, to produce a sample of size that's approximately equal 
to the sum of
    -   * math.ceil(numItems * samplingRate) over all key values; otherwise, 
use additional passes over
    -   * the RDD to create a sample size that's exactly equal to the sum of
    +   * `fractions`, a key to sampling rate map, via simple random sampling 
with one pass over the
    +   * RDD, to produce a sample of size that's approximately equal to the 
sum of
        * math.ceil(numItems * samplingRate) over all key values.
        */
       def sampleByKey(withReplacement: Boolean,
           fractions: JMap[K, Double],
    -      exact: Boolean,
           seed: Long): JavaPairRDD[K, V] =
    -    new JavaPairRDD[K, V](rdd.sampleByKey(withReplacement, fractions, 
exact, seed))
    +    new JavaPairRDD[K, V](rdd.sampleByKey(withReplacement, fractions, 
seed))
     
       /**
        * Return a subset of this RDD sampled by key (via stratified sampling).
        *
        * Create a sample of this RDD using variable sampling rates for 
different keys as specified by
    -   * `fractions`, a key to sampling rate map.
    -   *
    -   * If `exact` is set to false, create the sample via simple random 
sampling, with one pass
    -   * over the RDD, to produce a sample of size that's approximately equal 
to the sum of
    -   * math.ceil(numItems * samplingRate) over all key values; otherwise, 
use additional passes over
    -   * the RDD to create a sample size that's exactly equal to the sum of
    +   * `fractions`, a key to sampling rate map, via simple random sampling 
with one pass over the
    +   * RDD, to produce a sample of size that's approximately equal to the 
sum of
        * math.ceil(numItems * samplingRate) over all key values.
        *
    -   * Use Utils.random.nextLong as the default seed for the random number 
generator
    +   * Use Utils.random.nextLong as the default seed for the random number 
generator.
        */
       def sampleByKey(withReplacement: Boolean,
    -      fractions: JMap[K, Double],
    -      exact: Boolean): JavaPairRDD[K, V] =
    -    sampleByKey(withReplacement, fractions, exact, Utils.random.nextLong)
    +      fractions: JMap[K, Double]): JavaPairRDD[K, V] =
    +    sampleByKey(withReplacement, fractions, Utils.random.nextLong)
     
       /**
    -   * Return a subset of this RDD sampled by key (via stratified sampling).
    +   * ::Experimental::
        *
    -   * Create a sample of this RDD using variable sampling rates for 
different keys as specified by
    -   * `fractions`, a key to sampling rate map.
    +   * Return a subset of this RDD sampled by key (via stratified sampling) 
containing exactly
    +   * math.ceil(numItems * samplingRate) for each stratum (group of pairs 
with the same key).
        *
    -   * Produce a sample of size that's approximately equal to the sum of
    -   * math.ceil(numItems * samplingRate) over all key values with one pass 
over the RDD via
    -   * simple random sampling.
    +   * This method differs from [[sampleByKey]] in that we make additional 
passes over the RDD to
    +   * create a sample size that's exactly equal to the sum of 
math.ceil(numItems * samplingRate)
    +   * over all key values with a 99.99% confidence. When sampling without 
replacement, we need one
    +   * additional pass over the RDD to guarantee sample size; when sampling 
with replacement, we need
    +   * two additional passes.
        */
    -  def sampleByKey(withReplacement: Boolean,
    +  @Experimental
    +  def sampleByKeyExact(withReplacement: Boolean,
           fractions: JMap[K, Double],
           seed: Long): JavaPairRDD[K, V] =
    -    sampleByKey(withReplacement, fractions, false, seed)
    +    new JavaPairRDD[K, V](rdd.sampleByKeyExact(withReplacement, fractions, 
seed))
     
       /**
    -   * Return a subset of this RDD sampled by key (via stratified sampling).
    +   * ::Experimental::
        *
    --- End diff --
    
    ditto: remove this line



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2937] Separate out samplyByKeyExact as ...

Reply via email to