Re: [PR] CrossJoin Refactor [arrow-datafusion]

via GitHub Mon, 01 Apr 2024 01:28:02 -0700


berkaysynnada commented on PR #9830:
URL: 
https://github.com/apache/arrow-datafusion/pull/9830#issuecomment-2029397635


   > And one more thing to consider (and this is the second concern) -- if 
LeftData will be relatively small (it's natural behaviour, due to the fact that 
CrossJoin supports input reordering basing on statistics), switching from "left 
row + right batch" to "right row + left batch" might potentially produce a 
massive amount of small batches -- to avoid it we might need to either wrap 
CrossJoin into CoalesceBatches (via this 
[rule](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_optimizer/coalesce_batches.rs)),
 or to add some built-in buffering.
   > 
   > In current implementation small left side seems to be +- safe as we assume 
that right side is either large enough to provide full-size batches, and it's 
OK to produce output batch per left-side row, or, if right-side is small, due 
to reordering, left-side should be even smaller, so it won't make much harm.
   
   I can add a `CoalesceBatchesExec` into the left child, but it requires more 
analysis such that what is the batch size of left child, is it already a 
`CoalesceBatchesExec`, if it is so how would they be merged etc. What I observe 
is that the rule adds `CoalesceBatchesExec` above the plans which 
reduces/filters the number of rows. CrossJoin does not do such a thing. I think 
all streams are written assuming they receive the correct number of batch size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] CrossJoin Refactor [arrow-datafusion]

Reply via email to