alamb commented on issue #17267:
URL: https://github.com/apache/datafusion/issues/17267#issuecomment-3211526631

   TLDR is I agree that the [hybrid hash 
join](https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join) is basically 
state of the art  for hash join implementations and I think would be a great 
end goal
   
   > I recommend to get sort merge join working reliably before experimenting 
HJ spilling (i.e. benchmarks should be able to finish under a modest memory 
limit, perhaps also more tests), 
   
   I think this is a really wise suggestion from @2010YOUY01 
   
   > Yes I think it would be a good idea to work on a POC for this and the SMJ 
improvements in parallel. Which parts of SMJ has problems currently? nothing 
seems to stand out to me. cc [@comphead](https://github.com/comphead)
   
   One major limitation today is that the the user has to pick between HashJoin 
and MergeJoin at *Plan* time. 
   
   From the [config 
settings](https://datafusion.apache.org/user-guide/configs.html)
   
   datafusion.optimizer.prefer_hash_join | true | When set to true, the 
physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can 
work more efficiently than SortMergeJoin but consumes more memory
   -- | -- | --
   
   
   
   ## Proposed Course of Action
   
   1. Phase 1: Implement HashJoin spilling as a "runtime switch to MergeJoin" 
-- basically dump the hash table to spill files and switch to using MergeJoin 
for the rest of the query
   2. Phase 2: Implement the more sophisticated HybridHashJoin
   
   Rationale:
   1. As @2010YOUY01  says, I think the runtime switch to MergeJoin is simpler 
and can build on our work with the spilled sorts.
   2. While we implement the simpler case, we will gain experience (and write 
tests) the HashJoin code
   3. The major downside of the "runtime switch to MergeJoin" approach is that 
it has a "performance cliff" -- if the operator switches from Hash to MergeJoin 
it will be much slower
   
   The major benefit of a Hybrid Hash Join type approach is that it degrades 
more gracefully (as you only need to spill some partitions, for example). While 
an important feature, I think we should first focus on basic functionality
   
   I have some more thoughts about the hybrid hash join approach, but I don't 
want to confuse this conversation now
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to