cloud-fan commented on PR #44133:
URL: https://github.com/apache/spark/pull/44133#issuecomment-1914014276

   It's really bad that the shuffle hash is coupled with the specific integral 
type, and makes the join keys sensitive to CAST. However, to be compatible with 
existing bucketed tables, it's too late to change the shuffle hash function now 
:(
   
   Now this becomes a complicated cost problem: it's not only about the number 
of shuffles, but also the final parallelism. It can also affect data skewness: 
if join predicate is int=int, then all the long values that are larger than 
Int.Max will go to the same partition. If join predicate is long=long, then the 
data distribution will be more even, as long values that are larger than 
Int.Max won't be in the same partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to