Dandandan opened a new pull request #8961:
URL: https://github.com/apache/arrow/pull/8961
This PR uses the `num_rows` statistics to implement a common optimization to
use the smallest table for the build phase.
This is a good heuristic, as there are less items to insert to the hash
table and the size of tables can be very imbalanced.
Some notes:
* The optimization works on the `LogicalPlan` by swapping left and right,
the join type and the key order. This seems currently the easiest place to add
it, as there is no cost based optimizer and/or optimizers on the physical plan
yet. The optimization rule assumes that the left part of the join will be used
for the build phase and the right part for the probe phase.
* It requires the number of rows to be exactly known, so it will not work
whenever there is a transformation changing the number of rows, except for
`limit`. The idea here is that in other cases, it is very hard to estimate the
number of resulting rows.
* The impact currently is negative, as the hash join implementation seems to
currently be slower when the right side of the join is bigger. That seems to
strange and unexpected, but it seems better to disable this optimization until
that is "fixed".
FYI @andygrove @jorgecarleitao
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]