alamb opened a new issue #141:
URL: https://github.com/apache/arrow-datafusion/issues/141


   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-11094
   
   The current hash join works well when one side of the join can be loaded 
into memory but cannot scale beyond the available RAM.
   
   The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the 
left and right partitions, and write the intermediate results to disk, and then 
stream both sides of the join by merging these sorted partitions and we do not 
need to load one side into memory. At most, we need to load all batches from 
both sides that contain the current join key values.
   
   In order to reduce memory pressure we will want to limit the concurrency of 
these sort operations.
   
   We would still want to default to hash join when we know that the build-side 
can fit into memory since it is more efficient than using a sort-merge join.
   
   [https://en.wikipedia.org/wiki/Sort-merge_join]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to