[GitHub] [spark] cloud-fan commented on a diff in pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

GitBox Tue, 20 Sep 2022 20:09:39 -0700


cloud-fan commented on code in PR #37855:
URL: https://github.com/apache/spark/pull/37855#discussion_r975993118



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala:
##########
@@ -299,7 +300,8 @@ object ShuffleExchangeExec {
     def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match 
{
       case RoundRobinPartitioning(numPartitions) =>
         // Distributes elements evenly across output partitions, starting from 
a random partition.
-        var position = new 
Random(TaskContext.get().partitionId()).nextInt(numPartitions)

Review Comment:
   OK I tried `(1 to 200).foreach(partitionId => print(new 
Random(partitionId).nextInt(32) + " "))` and the result is very 
counterintuitive. A small change for the seed does not change the random result.
   
   Can we add some comments to explain why we add `hashing.byteswap32`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a diff in pull request #37855: [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition

Reply via email to