Github user sddyljsx commented on a diff in the pull request: https://github.com/apache/spark/pull/21859#discussion_r209417115 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -166,9 +169,17 @@ class RangePartitioner[K : Ordering : ClassTag, V]( // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition) + val numSampled = sketched.map(_._3.length).sum if (numItems == 0L) { Array.empty } else { + // already got the whole data + if (sampleCacheEnabled && numItems == numSampled) { + // get the sampled data + sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) => { --- End diff -- as @kiszk suggests in his review: Do we need to always create sampledArray and to store into var? It may lead to overhead when the execution would go to L182. It would be good to calculate only length here and to create the array at L179. maybe allocate it when necessary is a better choice
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org