mingmwang commented on PR #5745:
URL:
https://github.com/apache/arrow-datafusion/pull/5745#issuecomment-1488280207
> My intuitions says, that 2 partial aggregates in **PARALLEL** branches
would be more susceptible to parallelization - than one big partial aggregate. 🧐
>
> Could you clarify, why that's not the case?
This is related to how UnionExec is actually executed:
For example, if we have a plan fragment like below, if each CsvExec's
parallelization is 1, then the AggregateExec's parallelization is 2
```
AggregateExec(2)
UnionExec(2)
CsvExec(1)
CsvExec(1)
```
If we push down the AggregateExec through UnionExec like below, the total
parallelization of AggregateExec is still 2.
This is what I mean the pushdown is useless.
```
UnionExec(2)
AggregateExec(1)
CsvExec(1)
AggregateExec(1)
CsvExec(1)
```
Now we take the RepartitionExec into consideration, if the RepartitionExec
change the parallelization to 4, the parallelization of AggregateExec is 4
```
AggregateExec(4)
RepartitionExec(4)
UnionExec(2)
CsvExec(1)
CsvExec(1)
```
If we push down the AggregateExec through RepartitionExec + UnionExec , it
will reduce the parallelization of AggregateExec to 2 and will not get
performance benefits
```
RepartitionExec(4)
UnionExec(2)
AggregateExec(1)
CsvExec(1)
AggregateExec(1)
CsvExec(1)
```
Hope this explains.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]