wbo4958 commented on code in PR #37855:
URL: https://github.com/apache/spark/pull/37855#discussion_r969514082
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala:
##########
@@ -298,8 +297,7 @@ object ShuffleExchangeExec {
}
def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match
{
case RoundRobinPartitioning(numPartitions) =>
- // Distributes elements evenly across output partitions, starting from
a random partition.
Review Comment:
@HyukjinKwon Thx for reviewing.
The original comment "starting from a random partition", I think please
correct me if I am wrong, is meaning "from which reducer partition beginning to
do the shuffle write with RoundRobin manner. Basically, the **big** data should
be distributed evenly in the same partition. But the issue here is the shuffle
partition does not contain much data, the data actually is smaller than the
total reducer partition, which means if the starting position is the same for
all the shuffle partitions, then all the data will be distributed into the same
reducer partitions for all the shuffle partitions, and some reducer partitions
will not have any data.
This PR just makes the partitionId the default starting position to do the
RoundRobin, which means each shuffle partition has a different starting
position,
I tested the below code
``` scala
val df = spark.range(0, 100, 1, 50).repartition(4)
val v = df.rdd.mapPartitions { iter => {
Iterator.single(iter.length)
}
}.collect()
println(v.mkString(","))
```
w/ my PR, It outputs `24,25,26,25`, w/ o my PR, it outputs `50,0,0,50`
Similarly, if I change to repartition(8)
w/ my PR, It outputs `12,13,14,13,12,12,12,12`, w/ o my PR, it outputs
`0,0,0,0,0,0,50,50`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]