[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

GitBox Tue, 20 Apr 2021 08:15:47 -0700


cloud-fan commented on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-823357554



   I'm a bit worried about this solution:
   1. sorting the stream-side at runtime may lead to slow query plan because 
the sort is not whole-stage-codegen-ed.
   2. unlike SMJ, the output ordering can't be preserved if we sort the 
stream-side at runtime.
   
   I think the eventual goal is to enable shuffle hash join by default, but I'm 
not sure adding the fallback can achieve this goal. Do you have some real data 
to show the benefits?
   
   Another idea is to pick shuffle hash join in AQE when we know the 
per-partition size after shuffle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

Reply via email to