[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

sddyljsx Mon, 20 Aug 2018 04:37:28 -0700

Github user sddyljsx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21859#discussion_r211230520
  
    --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
    @@ -166,9 +169,20 @@ class RangePartitioner[K : Ordering : ClassTag, V](
           // Assume the input partitions are roughly balanced and over-sample 
a little bit.
           val sampleSizePerPartition = math.ceil(3.0 * sampleSize / 
rdd.partitions.length).toInt
           val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), 
sampleSizePerPartition)
    +      val numSampled = sketched.map(_._3.length).sum
           if (numItems == 0L) {
             Array.empty
           } else {
    +        // already got the whole data
    +        if (sampleCacheEnabled && numItems == numSampled) {
    +          // get the sampled data
    +          sampledArray = new Array[K](numSampled)
    +          var curPos = 0
    +          sketched.foreach(_._3.foreach(sampleRow => {
    --- End diff --
    
    We need the sample data to form a rdd for the next step.
    I can't think of a better idea  than merging  the sample data arrays from 
each partition to one array and parallelizing it to a rdd:
    ```
    sparkContext.parallelize(partitioner.getSampledArray.toSeq, 
rdd.getNumPartitions)
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...

Reply via email to