Re: [I] [Discussion] Should hash join order be based on memory size? [datafusion]

via GitHub Tue, 12 May 2026 07:52:21 -0700


alamb commented on issue #22098:
URL: https://github.com/apache/datafusion/issues/22098#issuecomment-4431721313


   > So what I'd like to discuss is if the hash join order should be decided 
based on a more complex heuristic. For example, "if the difference in size 
between the tables is less than X, go by row count, otherwise go by byte size". 
It appears that Postgres also does something like this:
   
   In general picking the right join order is a complex and multi-facted 
problem. It typically involves various heuristics, size and cardinality 
estimates, and many other things
   
   I am personally very skeptical that we can add more advanced heuristics that 
don't make the plans worse for some people (aka they will experience it as a 
regression). SO while this particular heuristic update looks reasonable I worry 
about unintended consequences. This is very much driven by my experience 
working with the Vertica optimizer where we had all sorts of challenges with 
complex join orders
   
   I think a better approach is to make the heuristic more tunable / pluggable 
so people can plug in whatever heustics they want. There is more backstory here:
   - https://github.com/apache/datafusion/issues/17718
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Discussion] Should hash join order be based on memory size? [datafusion]

Reply via email to