[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1831: determine build side in hash join by total_byte_size instead of num_rows

GitBox Mon, 14 Feb 2022 06:27:24 -0800


xudong963 opened a new pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831



   # Which issue does this PR close?
   
   No
   
    # Rationale for this change
   
   The memory size for hash join is limited,  users can use MemoryManager 
(introduced by @yjshen ) to limit memory by their demands.  So if we can load 
the entire build side into memory it will be very efficient, otherwise, we need 
to do something else like spilling to disk.
   
   Currently, DF uses tables' `num_rows` to determine the build side of the 
hash join. It's not reasonable. In general, we say to select the smaller table 
as build side in hash join, the **smaller** in fact means the total bytes size 
in a table. (A smaller number of rows does not mean the entire table is 
smaller).
   
   # What changes are included in this PR?
   
   determine build side in hash join by total_byte_size instead of num_rows.
   
   # Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1831: determine build side in hash join by total_byte_size instead of num_rows

Reply via email to