berkaysynnada commented on PR #9830: URL: https://github.com/apache/arrow-datafusion/pull/9830#issuecomment-2029397635
> And one more thing to consider (and this is the second concern) -- if LeftData will be relatively small (it's natural behaviour, due to the fact that CrossJoin supports input reordering basing on statistics), switching from "left row + right batch" to "right row + left batch" might potentially produce a massive amount of small batches -- to avoid it we might need to either wrap CrossJoin into CoalesceBatches (via this [rule](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_optimizer/coalesce_batches.rs)), or to add some built-in buffering. > > In current implementation small left side seems to be +- safe as we assume that right side is either large enough to provide full-size batches, and it's OK to produce output batch per left-side row, or, if right-side is small, due to reordering, left-side should be even smaller, so it won't make much harm. I can add a `CoalesceBatchesExec` into the left child, but it requires more analysis such that what is the batch size of left child, is it already a `CoalesceBatchesExec`, if it is so how would they be merged etc. What I observe is that the rule adds `CoalesceBatchesExec` above the plans which reduces/filters the number of rows. CrossJoin does not do such a thing. I think all streams are written assuming they receive the correct number of batch size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
