Re: [I] [DISCUSSION] Sort Merge Join Experimental status [arrow-datafusion]

via GitHub Wed, 03 Apr 2024 06:13:59 -0700


alamb commented on issue #9846:
URL: 
https://github.com/apache/arrow-datafusion/issues/9846#issuecomment-2034578501


   > Is there a rule of thumb for choosing SMJ over HJ?
   
   I believe current state of the art in query processing is
   1. If the  data is already sorted by join keys, use MergeJoin (as @Dandandan 
  says)
   2. If the data is not already sorted on join key, use HashJoin
   3. If HashJoin runs out of memory building the hash table, spill the table 
to disk (possibly switching to merge join internally)
   
   The only benefit SMJ has over HJ at the moment in Datafusion is that we 
could plausibly join relations that are larger than memory using SMJ (using the 
fact that we can spill the inputs) -- this may be  what @Dandandan is saying in 
https://github.com/apache/arrow-datafusion/issues/9846#issuecomment-2034369728
   
   I think it is close to impossible to make MJ beat HJ for performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [DISCUSSION] Sort Merge Join Experimental status [arrow-datafusion]

Reply via email to