mingmwang commented on PR #5745:
URL: 
https://github.com/apache/arrow-datafusion/pull/5745#issuecomment-1488280207

   > My intuitions says, that 2 partial aggregates in **PARALLEL** branches 
would be more susceptible to parallelization - than one big partial aggregate. 🧐
   > 
   > Could you clarify, why that's not the case?
   
   This is related to how UnionExec is actually executed:
   For example, if we have a plan fragment like below, if each CsvExec's 
parallelization is 1, then the AggregateExec's parallelization is 2
   
   ```
   AggregateExec(2)
       UnionExec(2)
          CsvExec(1)
          CsvExec(1)
   ```
   
   If we push down the  AggregateExec through UnionExec like below,  the total 
parallelization of AggregateExec is still 2.
   This is what I mean the pushdown is useless.
   
   ```
    UnionExec(2)
       AggregateExec(1)
          CsvExec(1)
      AggregateExec(1)
          CsvExec(1)
   ```
   
   Now we take the RepartitionExec into consideration, if the RepartitionExec 
change the parallelization to 4, the parallelization of AggregateExec is 4
   
   
   ```
   AggregateExec(4)
      RepartitionExec(4)
          UnionExec(2)
            CsvExec(1)
            CsvExec(1)
   ```
   If we push down the AggregateExec  through RepartitionExec + UnionExec ,  it 
will reduce the parallelization of AggregateExec to 2 and will not get 
performance benefits
   
   ```
    RepartitionExec(4)
        UnionExec(2)
           AggregateExec(1)
              CsvExec(1)
           AggregateExec(1)
              CsvExec(1)
   ```
   
   Hope this explains.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to