Github user sddyljsx commented on a diff in the pull request:
https://github.com/apache/spark/pull/21859#discussion_r209417115
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -166,9 +169,17 @@ class RangePartitioner[K : Ordering : ClassTag, V](
// Assume the input partitions are roughly balanced and over-sample
a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize /
rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1),
sampleSizePerPartition)
+ val numSampled = sketched.map(_._3.length).sum
if (numItems == 0L) {
Array.empty
} else {
+ // already got the whole data
+ if (sampleCacheEnabled && numItems == numSampled) {
+ // get the sampled data
+ sampledArray = sketched.foldLeft(Array.empty[K])((total, sample)
=> {
--- End diff --
as @kiszk suggests in his review:
Do we need to always create sampledArray and to store into var? It may lead
to overhead when the execution would go to L182.
It would be good to calculate only length here and to create the array at
L179.
maybe allocate it when necessary is a better choice
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]