korowa commented on PR #9830:
URL: 
https://github.com/apache/arrow-datafusion/pull/9830#issuecomment-2030022146

   > I can add a `CoalesceBatchesExec` into the left child, but it requires 
more analysis such that what is the batch size of left child, is it already a 
`CoalesceBatchesExec`, if it is so how would they be merged etc. What I observe 
is that the rule adds `CoalesceBatchesExec` above the plans which 
reduces/filters the number of rows. CrossJoin does not do such a thing. I think 
all streams are written assuming they receive the correct number of batch size.
   
   Yes, I was pointing out that output of CrossJoin might require to be 
coalesced (even if both inputs are fine in terms of batch-sizes). Here is an 
example: for query
   ```
   EXPLAIN ANALYZE SELECT * FROM nation JOIN lineitem ON TRUE
   ```
   (both tables are from tpch-sf1), I've obtained following metrics for Join:
   ```
   -- current version from main
   CrossJoinExec, metrics=[output_rows=150030375, build_input_rows=25, 
build_input_batches=1, input_batches=733, input_rows=150030375, 
output_batches=18325, build_mem_used=3376, build_time=5.742028ms, 
join_time=14.925451535s]
   
   -- version from this PR
   CrossJoinExec, metrics=[output_rows=150030375, build_input_batches=1, 
input_rows=6001215, output_batches=6001215, input_batches=733, 
build_input_rows=25, build_mem_used=3376, join_time=433.554947232s, 
build_time=1.836068ms]
   ```
   In case build-side is relatively small in comparison with target_batch_size, 
cross-join produces massive amount of "undersaturated" batches.
   
   + these are metrics collected while running in debug mode, but still -- they 
show significant slowdown for such inputs case (may be explained by x300 more 
output batches, and work related to their construction) -- if such slowdown 
retains in release mode, I suppose, coalescing batches won't be a good option, 
and it's better to figure out how to produce proper batches right in join 
execution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to