Github user sddyljsx commented on a diff in the pull request:
https://github.com/apache/spark/pull/21859#discussion_r211230520
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -166,9 +169,20 @@ class RangePartitioner[K : Ordering : ClassTag, V](
// Assume the input partitions are roughly balanced and over-sample
a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize /
rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1),
sampleSizePerPartition)
+ val numSampled = sketched.map(_._3.length).sum
if (numItems == 0L) {
Array.empty
} else {
+ // already got the whole data
+ if (sampleCacheEnabled && numItems == numSampled) {
+ // get the sampled data
+ sampledArray = new Array[K](numSampled)
+ var curPos = 0
+ sketched.foreach(_._3.foreach(sampleRow => {
--- End diff --
We need the sample data to form a rdd for the next step.
I can't think of a better idea than merging the sample data arrays from
each partition to one array and parallelizing it to a rdd:
```
sparkContext.parallelize(partitioner.getSampledArray.toSeq,
rdd.getNumPartitions)
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]