Re: [I] CrossJoin Implementation on (M x N) Partitions [datafusion]

via GitHub Mon, 21 Oct 2024 00:15:23 -0700


berkaysynnada commented on issue #9941:
URL: https://github.com/apache/datafusion/issues/9941#issuecomment-2425815233


   > How would this work? At this moment the entire left side is loaded in 
memory, basically performing a 
https://en.m.wikipedia.org/wiki/Block_nested_loop with all data loaded into 
memory. The left child plan can be executed in parallel. Not loading all of the 
left side seems to require scanning one side more than once?
   
   Let's say left has 2 partitions (L1, L2), and right has 3(R1, R2, R3), and 
the target partition count is 6(T1, T2, T3, T4, T5, T6). `execute()` of 
CrossJoin is called for those 6 partitions, and each of them calls 
corresponding input partitions, like T1: L1-R1 -- T2: L1-R2 --  T3: L1-R3 -- 
T4: L2-R1 --  T5: L2-R2 -- T6: L2-R3
   
   I don't think we need multiple scan for the same source. The first one can 
scan and share the data with other users. That said, I am not sure this 
approach would bring significant performance improvements, but I saw the TODO 
item and thought it would be interesting to discuss with you.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] CrossJoin Implementation on (M x N) Partitions [datafusion]

Reply via email to