UBarney commented on issue #15760:
URL: https://github.com/apache/datafusion/issues/15760#issuecomment-2821374117

   > If the right side is a table backed by a local file, the operator can scan 
it again without buffering or spilling. 
   
   I previously thought that 'backed by a local file' and 'spilling' were the 
same thing.
   
   I find that the current implementation loads all results of 
`join(all_left_table, current_right_batch)` into memory at once. This could 
potentially contain up to `all_left_table.row_count() * 
current_right_batch.row_count()` rows in the worst case.
   
   
https://github.com/apache/datafusion/blob/3fe300c1525d373e9d78c3830e0990204c854a6a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L885-L895
   
   I added some log to confirm that:
   ```
   +                let batch1 = result?;
   +                dbg!(&batch1.num_rows());
   +
   +                self.batch_transformer.set_batch(batch1);
   
   ```
   
   output
   ```
   > select count(*) from (select * from  generate_series(10000,20000) as t1 
join generate_series(8194)  as t2 on t1.value != t2.value);
   
   [datafusion/physical-plan/src/joins/nested_loop_join.rs:900:17] 
&batch1.num_rows() = 30003
   [datafusion/physical-plan/src/joins/nested_loop_join.rs:900:17] 
&batch1.num_rows() = 81928192
   ```
   
   To optimize memory usage, we could potentially compute the join results 
streamly. Maybe in a separate coroutine and stream them through a channel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to