Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/1866#discussion_r16032165
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala
---
@@ -133,68 +133,64 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
* Return a subset of this RDD sampled by key (via stratified sampling).
*
* Create a sample of this RDD using variable sampling rates for
different keys as specified by
- * `fractions`, a key to sampling rate map.
- *
- * If `exact` is set to false, create the sample via simple random
sampling, with one pass
- * over the RDD, to produce a sample of size that's approximately equal
to the sum of
- * math.ceil(numItems * samplingRate) over all key values; otherwise,
use additional passes over
- * the RDD to create a sample size that's exactly equal to the sum of
+ * `fractions`, a key to sampling rate map, via simple random sampling
with one pass over the
+ * RDD, to produce a sample of size that's approximately equal to the
sum of
* math.ceil(numItems * samplingRate) over all key values.
*/
def sampleByKey(withReplacement: Boolean,
fractions: JMap[K, Double],
- exact: Boolean,
seed: Long): JavaPairRDD[K, V] =
- new JavaPairRDD[K, V](rdd.sampleByKey(withReplacement, fractions,
exact, seed))
+ new JavaPairRDD[K, V](rdd.sampleByKey(withReplacement, fractions,
seed))
/**
* Return a subset of this RDD sampled by key (via stratified sampling).
*
* Create a sample of this RDD using variable sampling rates for
different keys as specified by
- * `fractions`, a key to sampling rate map.
- *
- * If `exact` is set to false, create the sample via simple random
sampling, with one pass
- * over the RDD, to produce a sample of size that's approximately equal
to the sum of
- * math.ceil(numItems * samplingRate) over all key values; otherwise,
use additional passes over
- * the RDD to create a sample size that's exactly equal to the sum of
+ * `fractions`, a key to sampling rate map, via simple random sampling
with one pass over the
+ * RDD, to produce a sample of size that's approximately equal to the
sum of
* math.ceil(numItems * samplingRate) over all key values.
*
- * Use Utils.random.nextLong as the default seed for the random number
generator
+ * Use Utils.random.nextLong as the default seed for the random number
generator.
*/
def sampleByKey(withReplacement: Boolean,
- fractions: JMap[K, Double],
- exact: Boolean): JavaPairRDD[K, V] =
- sampleByKey(withReplacement, fractions, exact, Utils.random.nextLong)
+ fractions: JMap[K, Double]): JavaPairRDD[K, V] =
+ sampleByKey(withReplacement, fractions, Utils.random.nextLong)
/**
- * Return a subset of this RDD sampled by key (via stratified sampling).
+ * ::Experimental::
*
- * Create a sample of this RDD using variable sampling rates for
different keys as specified by
- * `fractions`, a key to sampling rate map.
+ * Return a subset of this RDD sampled by key (via stratified sampling)
containing exactly
+ * math.ceil(numItems * samplingRate) for each stratum (group of pairs
with the same key).
*
- * Produce a sample of size that's approximately equal to the sum of
- * math.ceil(numItems * samplingRate) over all key values with one pass
over the RDD via
- * simple random sampling.
+ * This method differs from [[sampleByKey]] in that we make additional
passes over the RDD to
+ * create a sample size that's exactly equal to the sum of
math.ceil(numItems * samplingRate)
+ * over all key values with a 99.99% confidence. When sampling without
replacement, we need one
+ * additional pass over the RDD to guarantee sample size; when sampling
with replacement, we need
+ * two additional passes.
*/
- def sampleByKey(withReplacement: Boolean,
+ @Experimental
+ def sampleByKeyExact(withReplacement: Boolean,
fractions: JMap[K, Double],
seed: Long): JavaPairRDD[K, V] =
- sampleByKey(withReplacement, fractions, false, seed)
+ new JavaPairRDD[K, V](rdd.sampleByKeyExact(withReplacement, fractions,
seed))
/**
- * Return a subset of this RDD sampled by key (via stratified sampling).
+ * ::Experimental::
*
--- End diff --
ditto: remove this line
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]