cloud-fan commented on PR #44133: URL: https://github.com/apache/spark/pull/44133#issuecomment-1914014276
It's really bad that the shuffle hash is coupled with the specific integral type, and makes the join keys sensitive to CAST. However, to be compatible with existing bucketed tables, it's too late to change the shuffle hash function now :( Now this becomes a complicated cost problem: it's not only about the number of shuffles, but also the final parallelism. It can also affect data skewness: if join predicate is int=int, then all the long values that are larger than Int.Max will go to the same partition. If join predicate is long=long, then the data distribution will be more even, as long values that are larger than Int.Max won't be in the same partition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
