[GitHub] [arrow-datafusion] Dandandan opened a new issue, #6892: Introduce `Partitioned` aggregation mode

via GitHub Sat, 08 Jul 2023 14:38:48 -0700


Dandandan opened a new issue, #6892:
URL: https://github.com/apache/arrow-datafusion/issues/6892


   ### Is your feature request related to a problem or challenge?
   
   Currently, we often use two modes in aggregations:`Partial` + 
`FinalPartitioned`. `FinalPartitioned` requires the input to be 
hash-repartioned, so a `RepartionExec` is added in between.
   
   In certain cases, like when only aggregating on a single column, it is 
faster to skip the `Partial` aggregation and directly perform the aggregation 
on hash-partitioned input, as doing the `Partial` + `RepartionExec` + 
`FinalPartitioned`  will be more work than doing the aggregation in one step 
(`RepartionExec` + `Partitioned`).
   
   Reasoning: RepartionExec (hash) is itself faster than `Partial` (although it 
doesn't reduce the output) and is necessary for `FinalPartitioned`.
   
   ### Describe the solution you'd like
   
   1. Introduce `Partitioned` aggregation mode that requires input to be 
partitioned on the groupby-keys.
   2. Utilize it in certain cases, like aggregations utilizing only one/few 
columns (consider queries such as`SELECT COUNT(DISTINCT a) FROM T`)
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue, #6892: Introduce `Partitioned` aggregation mode

Reply via email to