2010YOUY01 opened a new issue, #15760: URL: https://github.com/apache/datafusion/issues/15760
### Is your feature request related to a problem or challenge? The common NLJ implementation consumes constant memory. However, DataFusion's implementation is optimized for execution time, which requires it to buffer all input data on one side, making it possible to fail under memory-constrained cases. The following pseudocode explains the current implementation: ## Normal NLJ ``` for l_batch in nlj.left.get_next_batch(): for r_batch in nlj.right.get_next_batch(): let matched_entries = match(l_batch, r_batch); return matched_entries nlj.right.reset(); // Let right side of the join restart and scan again ``` ## DataFusion's NLJ implementation ``` for l_batch in nlj.left.get_next_batch(): buffered_left_batches.push(l_batch); for r_batch in nlj.right.get_next_batch(): let matched_entries = match(buffered_left_batches, r_batch); return matched_batches; ``` Related code: https://github.com/apache/datafusion/blob/128217081ca922c404fe423c3c1b945662d53c8a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L499 ### Describe the solution you'd like When the memory limit is reached when buffering the left side input, start probing the right side. After it's done, collect the remaining left side entries, and let the right side scan from the beginning and probe again, until finished. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org