cloud-fan commented on pull request #32210: URL: https://github.com/apache/spark/pull/32210#issuecomment-823357554
I'm a bit worried about this solution: 1. sorting the stream-side at runtime may lead to slow query plan because the sort is not whole-stage-codegen-ed. 2. unlike SMJ, the output ordering can't be preserved if we sort the stream-side at runtime. I think the eventual goal is to enable shuffle hash join by default, but I'm not sure adding the fallback can achieve this goal. Do you have some real data to show the benefits? Another idea is to pick shuffle hash join in AQE when we know the per-partition size after shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
