c21 commented on pull request #32328: URL: https://github.com/apache/spark/pull/32328#issuecomment-827072227
> Yes, it's a side effect for OptimizeSkewedJoin, but smj's advantage is it could spill. IMO, if user specify the shuffled hash join to do execution that means they know the benefit and issue of it. And in the other hand, we can easily increase the memory but hard to make skew join fast. So this optimization can be the extra choice for user. I agree this adds extra choice for user given current status of thing. But in the long-term, we would like to work towards enabling shuffled hash join by default (i.e. `spark.sql.join.preferSortMergeJoin`=false). This seems to me add more [risk](https://github.com/apache/spark/pull/32328#issuecomment-826601325) to the long term direction. So I think we should be more cautious with it and have more discussion. cc @cloud-fan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
