[GitHub] spark pull request #22961: [SPARK-25947][SQL] Reduce memory usage in Shuffle...

mu5358271 Thu, 08 Nov 2018 13:42:56 -0800

Github user mu5358271 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22961#discussion_r232070793
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 ---
    @@ -214,13 +214,24 @@ object ShuffleExchangeExec {
               override def getPartition(key: Any): Int = key.asInstanceOf[Int]
             }
           case RangePartitioning(sortingExpressions, numPartitions) =>
    -        // Internally, RangePartitioner runs a job on the RDD that samples 
keys to compute
    -        // partition bounds. To get accurate samples, we need to copy the 
mutable keys.
    +        // Extract only fields used for sorting to avoid collecting large 
fields that does not
    +        // affect sorting result when deciding partition bounds in 
RangePartitioner
             val rddForSampling = rdd.mapPartitionsInternal { iter =>
    +          val projection =
    +            UnsafeProjection.create(sortingExpressions.map(_.child), 
outputAttributes)
               val mutablePair = new MutablePair[InternalRow, Null]()
    -          iter.map(row => mutablePair.update(row.copy(), null))
    +          // Internally, RangePartitioner runs a job on the RDD that 
samples keys to compute
    +          // partition bounds. To get accurate samples, we need to copy 
the mutable keys.
    +          iter.map(row => mutablePair.update(projection(row).copy(), null))
             }
    -        implicit val ordering = new 
LazilyGeneratedOrdering(sortingExpressions, outputAttributes)
    +        // Construct ordering on extracted sort key.
    +        val orderingAttributes =
    --- End diff --
    
    you are right. this is much better. changed



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22961: [SPARK-25947][SQL] Reduce memory usage in Shuffle...

Reply via email to