cloud-fan commented on pull request #30829: URL: https://github.com/apache/spark/pull/30829#issuecomment-765497632
After a second thought, I think adding "unhandled" shuffle is a bit risky, but allowing the stage optimization phase to add new shuffles is too complicated. I'd like to revisit the idea of putting the skew join optimization rule in the stage preparation phase. For the two points you gave: 1. I think it's not true now. The comment is stale. If you look at the classdoc of `OptimizeSkewedJoin`, it says `when this rule is enabled, it also coalesces non-skewed partitions like CoalesceShufflePartitions does.` So I don't think `OptimizeSkewedJoin` needs to be run after `CoalesceShufflePartitions`. 2. We can add some checks and only trigger `OptimizeSkewedJoin` if the related shuffle stages are all materialized. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
