c21 edited a comment on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-826493892
After skew join handling, the output partitioning is destroyed, but this approach keeps output partitioning. [Skew join handling will not be enabled if it introduces extra shuffle in plan now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala#L279). But I agree the change in AQE for skew join handling is more incremental and less intrusive. But as we see here, I don't see major intrusive API change here for this PR neither. I am just brainstorming the pros and cons, and I think we should pick the direction towards the eventual goal - enabling shuffled hash join by default. @cloud-fan - as you mentioned earlier, I agree with that (1). run-time sort-based fallback in shuffled hash join itself & (2). AQE skew join handling / hybrid join features, to be orthogonal with each other. AQE is great to cover a lot of cases, but as we all know it has some limitations (listed some above and here). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
