Dandandan opened a new pull request #8961:
URL: https://github.com/apache/arrow/pull/8961


   This PR uses the `num_rows` statistics to implement a common optimization to 
use the smallest table for the build phase.
   This is a good heuristic, as there are less items to insert to the hash 
table and the size of tables can be very imbalanced.
   
   Some notes:
   
   * The optimization works on the `LogicalPlan` by swapping left and right, 
the join type and the key order. This seems currently the easiest place to add 
it, as there is no cost based optimizer and/or optimizers on the physical plan 
yet. The optimization rule assumes that the left part of the join will be used 
for the build phase and the right part for the probe phase.
   * It requires the number of rows to be exactly known, so it will not work 
whenever there is a transformation changing the number of rows, except for 
`limit`. The idea here is that in other cases, it is very hard to estimate the 
number of resulting rows.
   * The impact currently is negative, as the hash join implementation seems to 
currently be slower when the right side of the join is bigger. That seems to 
strange and unexpected, but it seems better to disable this optimization until 
that is "fixed".
   
    FYI @andygrove @jorgecarleitao 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to