Re: [PR] CrossJoin Refactor [arrow-datafusion]

via GitHub Tue, 02 Apr 2024 06:53:14 -0700


berkaysynnada commented on PR #9830:
URL: 
https://github.com/apache/arrow-datafusion/pull/9830#issuecomment-2032103953


   > > I can add a `CoalesceBatchesExec` into the left child, but it requires 
more analysis such that what is the batch size of left child, is it already a 
`CoalesceBatchesExec`, if it is so how would they be merged etc. What I observe 
is that the rule adds `CoalesceBatchesExec` above the plans which 
reduces/filters the number of rows. CrossJoin does not do such a thing. I think 
all streams are written assuming they receive the correct number of batch size.
   > 
   > Yes, I was pointing out that output of CrossJoin might require to be 
coalesced (even if both inputs are fine in terms of batch-sizes). Here is an 
example: for query
   > 
   
   Thank you for the detailed review and benchmark results. Yes, you are right. 
In those cases (where left batch sizes are less than target batch size) this 
strategy shows a drastic regression. I tried to concat all builded batch 
results until target batch size is reached, but it still shows a bad 
performance (approximately x20 slower). I think I should revert the changes and 
just remove the lock and ScalarValue conversions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] CrossJoin Refactor [arrow-datafusion]

Reply via email to