xudong963 opened a new pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831
# Which issue does this PR close?
No
# Rationale for this change
The memory size for hash join is limited, users can use MemoryManager
(introduced by @yjshen ) to limit memory by their demands. So if we can load
the entire build side into memory it will be very efficient, otherwise, we need
to do something else like spilling to disk.
Currently, DF uses tables' `num_rows` to determine the build side of the
hash join. It's not reasonable. In general, we say to select the smaller table
as build side in hash join, the **smaller** in fact means the total bytes size
in a table. (A smaller number of rows does not mean the entire table is
smaller).
# What changes are included in this PR?
determine build side in hash join by total_byte_size instead of num_rows.
# Are there any user-facing changes?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]