Re: [PR] GH-45917: [C++][Acero] Join result materialize in parallel [arrow]

via GitHub Tue, 25 Mar 2025 03:08:07 -0700


zanmato1984 commented on PR #45918:
URL: https://github.com/apache/arrow/pull/45918#issuecomment-2750739362


   Thank you. After some math I think I can explain the perf boost in your 
setup.
   
   Your thread count is `120`, so there are `120` `materialize_` s. You 
modified the `kNumRowsPerScanTask` to `4k` - let's call each `4k` rows a 
"batch" for short. You have `512 * (1 << 15)` unmatched rows, aka `4096` 
batches in the build side. Because `4096` > thread count so the parallelism of 
the scan will be `120`. Each scan thread accumulates `4096 / 120 = 34` batches 
into a `materialize_` with most of the rows output to the downstream (the agg) 
in parallel and `32k` rows pending to flush in `materialize_`. Finally when the 
scan is finished, the serial flushing will need to flush `120 * 32k` rows aka 
`960` batches in sequence and output to downstream (the agg), whereas the 
parallel flushing (this PR) will use `120` thread, each flushing `32k` rows aka 
`4` batches.
   
   Summarizing the comparison:
   1. In the scan phase, each thread processes `34` batches, this is the same 
for both serial and parallel flushing.
   2. In the flushing phase, serial execution processes `960` batches but 
parallel execution processes `4` batches (in each thread).
   3. After roughly solving some equations, it takes about `0.5`s to process a 
batch (mostly by the downstream agg). The serial flushing takes about 
`(/*scan*/34 + /*flush*/960) * 0.5 = 497`s, and the parallel flushing takes 
about `(/*scan*/34 + /*flush*/4) * 0.5 = 19`s.
   4. The numbers all add up.
   
   In terms of the improvement of this PR for `kNumRowsPerScanTask = 512k` (the 
original value), though we don't have numbers, we can infer that due to the 
less parallelism (`512 * (1 << 15) / 512k = 32` threads), the scan phase won't 
be as fast. And although there are still `120` `materialize_` s (the probe 
parallelism is still `120`), only `32` of them will have data (because of the 
scan parallelism being `32`), so the flushing will only boost for `32`x using 
parallel flushing.
   
   But anyway, I think the effectiveness of this PR is independent of 
`kNumRowsPerScanTask` and fully justified. I will now proceed with reviewing 
the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-45917: [C++][Acero] Join result materialize in parallel [arrow]

Reply via email to