berkaysynnada commented on issue #9941: URL: https://github.com/apache/datafusion/issues/9941#issuecomment-2425815233
> How would this work? At this moment the entire left side is loaded in memory, basically performing a https://en.m.wikipedia.org/wiki/Block_nested_loop with all data loaded into memory. The left child plan can be executed in parallel. Not loading all of the left side seems to require scanning one side more than once? Let's say left has 2 partitions (L1, L2), and right has 3(R1, R2, R3), and the target partition count is 6(T1, T2, T3, T4, T5, T6). `execute()` of CrossJoin is called for those 6 partitions, and each of them calls corresponding input partitions, like T1: L1-R1 -- T2: L1-R2 -- T3: L1-R3 -- T4: L2-R1 -- T5: L2-R2 -- T6: L2-R3 I don't think we need multiple scan for the same source. The first one can scan and share the data with other users. That said, I am not sure this approach would bring significant performance improvements, but I saw the TODO item and thought it would be interesting to discuss with you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
