korowa commented on PR #9830: URL: https://github.com/apache/arrow-datafusion/pull/9830#issuecomment-2030022146
> I can add a `CoalesceBatchesExec` into the left child, but it requires more analysis such that what is the batch size of left child, is it already a `CoalesceBatchesExec`, if it is so how would they be merged etc. What I observe is that the rule adds `CoalesceBatchesExec` above the plans which reduces/filters the number of rows. CrossJoin does not do such a thing. I think all streams are written assuming they receive the correct number of batch size. Yes, I was pointing out that output of CrossJoin might require to be coalesced (even if both inputs are fine in terms of batch-sizes). Here is an example: for query ``` EXPLAIN ANALYZE SELECT * FROM nation JOIN lineitem ON TRUE ``` (both tables are from tpch-sf1), I've obtained following metrics for Join: ``` -- current version from main CrossJoinExec, metrics=[output_rows=150030375, build_input_rows=25, build_input_batches=1, input_batches=733, input_rows=150030375, output_batches=18325, build_mem_used=3376, build_time=5.742028ms, join_time=14.925451535s] -- version from this PR CrossJoinExec, metrics=[output_rows=150030375, build_input_batches=1, input_rows=6001215, output_batches=6001215, input_batches=733, build_input_rows=25, build_mem_used=3376, join_time=433.554947232s, build_time=1.836068ms] ``` In case build-side is relatively small in comparison with target_batch_size, cross-join produces massive amount of "undersaturated" batches. + these are metrics collected while running in debug mode, but still -- they show significant slowdown for such inputs case (may be explained by x300 more output batches, and work related to their construction) -- if such slowdown retains in release mode, I suppose, coalescing batches won't be a good option, and it's better to figure out how to produce proper batches right in join execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
