[I] Memory limited nest loop join [datafusion]

via GitHub Fri, 18 Apr 2025 02:34:51 -0700


2010YOUY01 opened a new issue, #15760:
URL: https://github.com/apache/datafusion/issues/15760


   ### Is your feature request related to a problem or challenge?
   
   The common NLJ implementation consumes constant memory. However, 
DataFusion's implementation is optimized for execution time, which requires it 
to buffer all input data on one side, making it possible to fail under 
memory-constrained cases.
   
   The following pseudocode explains the current implementation:
   
   ## Normal NLJ
   ```
   for l_batch in nlj.left.get_next_batch():
       for r_batch in nlj.right.get_next_batch():
           let matched_entries = match(l_batch, r_batch);
           return matched_entries
       nlj.right.reset(); // Let right side of the join restart and scan again
   ```
   ## DataFusion's NLJ implementation
   ```
   for l_batch in nlj.left.get_next_batch():
       buffered_left_batches.push(l_batch);
   
   for r_batch in nlj.right.get_next_batch():
       let matched_entries = match(buffered_left_batches, r_batch);
       return matched_batches;
   ```
   
   Related code: 
https://github.com/apache/datafusion/blob/128217081ca922c404fe423c3c1b945662d53c8a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L499
   
   
   
   ### Describe the solution you'd like
   
   When the memory limit is reached when buffering the left side input, start 
probing the right side. After it's done, collect the remaining left side 
entries, and let the right side scan from the beginning and probe again, until 
finished.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Memory limited nest loop join [datafusion]

Reply via email to