wbo4958 commented on code in PR #37855:
URL: https://github.com/apache/spark/pull/37855#discussion_r969514082


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala:
##########
@@ -298,8 +297,7 @@ object ShuffleExchangeExec {
     }
     def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match 
{
       case RoundRobinPartitioning(numPartitions) =>
-        // Distributes elements evenly across output partitions, starting from 
a random partition.

Review Comment:
   @HyukjinKwon Thx for reviewing.
   
   The original comment "starting from a random partition", I think please 
correct me if I am wrong, is meaning "from which reducer partition beginning to 
do the shuffle write with RoundRobin manner. Basically, the **big** data should 
be distributed evenly in the same partition. But the issue here is the shuffle 
partition does not contain much data, the data actually is smaller than the 
total reducer partition, which means if the starting position is the same for 
all the shuffle partitions, then all the data will be distributed into the same 
reducer partitions for all the shuffle partitions, and some reducer partitions 
will not have any data.
   
   This PR just makes the partitionId the default starting position to do the 
RoundRobin, which means each shuffle partition has a different starting 
position,
   
   I tested the below code
   
   ```  scala
         val df = spark.range(0, 100, 1, 50).repartition(4)
         val v = df.rdd.mapPartitions { iter => {
           Iterator.single(iter.length)
         }
         }.collect()
         println(v.mkString(","))
   ```
   
   w/ my PR, It outputs `24,25,26,25`, w/ o my PR, it outputs `50,0,0,50`
   
   Similarly, if I change to repartition(8)
   
   w/ my PR, It outputs `12,13,14,13,12,12,12,12`, w/ o my PR, it outputs 
`0,0,0,0,0,0,50,50`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to