[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

kiszk Mon, 06 Aug 2018 02:03:56 -0700

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21859#discussion_r207819972
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -166,7 +170,13 @@ class RangePartitioner[K : Ordering : ClassTag, V](
           // Assume the input partitions are roughly balanced and over-sample 
a little bit.
           val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.length).toInt
           val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), 
sampleSizePerPartition)
    -      if (numItems == 0L) {
    +      // get the sampled data
    +      sampledArray = sketched.foldLeft(sampledArray)((total, sample) => {
    --- End diff --
    
    Do we need to always create `sampledArray` and to store into `var`? It may 
lead to overhead when the execution would go to L182.
    It would be good to calculate only length here and to create the array at 
L179.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

Reply via email to