alamb commented on issue #17267: URL: https://github.com/apache/datafusion/issues/17267#issuecomment-3211526631
TLDR is I agree that the [hybrid hash join](https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join) is basically state of the art for hash join implementations and I think would be a great end goal > I recommend to get sort merge join working reliably before experimenting HJ spilling (i.e. benchmarks should be able to finish under a modest memory limit, perhaps also more tests), I think this is a really wise suggestion from @2010YOUY01 > Yes I think it would be a good idea to work on a POC for this and the SMJ improvements in parallel. Which parts of SMJ has problems currently? nothing seems to stand out to me. cc [@comphead](https://github.com/comphead) One major limitation today is that the the user has to pick between HashJoin and MergeJoin at *Plan* time. From the [config settings](https://datafusion.apache.org/user-guide/configs.html) datafusion.optimizer.prefer_hash_join | true | When set to true, the physical plan optimizer will prefer HashJoin over SortMergeJoin. HashJoin can work more efficiently than SortMergeJoin but consumes more memory -- | -- | -- ## Proposed Course of Action 1. Phase 1: Implement HashJoin spilling as a "runtime switch to MergeJoin" -- basically dump the hash table to spill files and switch to using MergeJoin for the rest of the query 2. Phase 2: Implement the more sophisticated HybridHashJoin Rationale: 1. As @2010YOUY01 says, I think the runtime switch to MergeJoin is simpler and can build on our work with the spilled sorts. 2. While we implement the simpler case, we will gain experience (and write tests) the HashJoin code 3. The major downside of the "runtime switch to MergeJoin" approach is that it has a "performance cliff" -- if the operator switches from Hash to MergeJoin it will be much slower The major benefit of a Hybrid Hash Join type approach is that it degrades more gracefully (as you only need to spill some partitions, for example). While an important feature, I think we should first focus on basic functionality I have some more thoughts about the hybrid hash join approach, but I don't want to confuse this conversation now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
