[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

sddyljsx Fri, 10 Aug 2018 22:29:50 -0700

Github user sddyljsx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21859#discussion_r209417115
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -166,9 +169,17 @@ class RangePartitioner[K : Ordering : ClassTag, V](
           // Assume the input partitions are roughly balanced and over-sample 
a little bit.
           val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.length).toInt
           val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), 
sampleSizePerPartition)
    +      val numSampled = sketched.map(_._3.length).sum
           if (numItems == 0L) {
             Array.empty
           } else {
    +        // already got the whole data
    +        if (sampleCacheEnabled && numItems == numSampled) {
    +          // get the sampled data
    +          sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) 
=> {
    --- End diff --
    
    as @kiszk suggests in his review: 
    
    Do we need to always create sampledArray and to store into var? It may lead 
to overhead when the execution would go to L182.
    It would be good to calculate only length here and to create the array at 
L179.
    
    maybe allocate it when necessary is a better choice



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

Reply via email to