Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

via GitHub Sat, 29 Mar 2025 03:57:42 -0700


goldmedal commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2763293892


   @Dandandan 
   I have a draft https://github.com/goldmedal/datafusion/pull/3 based on 
#15423 for `HashAggregate`. Could you check if it's heading in the right 
direction?  
   
   When the selection vector mode is enabled:  
   - `CoalesceBatchesExec` is not added for `FinalPartitioned`.  
   - The selection vector is used to filter the required rows before merging 
batches.  
   
   The plan looks like this:
   ```
   > create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3)
   > explain select count(distinct c) from t;
   
+---------------+--------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                      |
   
+---------------+--------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: count(alias1) AS count(DISTINCT t.c)           
                                      |
   |               |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]          
                                      |
   |               |     Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]]        
                                      |
   |               |       TableScan: t projection=[c]                          
                                      |
   | physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT 
t.c)]                                    |
   |               |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]  
                                      |
   |               |     CoalescePartitionsExec                                 
                                      |
   |               |       AggregateExec: mode=Partial, gby=[], 
aggr=[count(alias1)]                                  |
   |               |         AggregateExec: mode=FinalPartitioned, 
gby=[alias1@0 as alias1], aggr=[]                  |
   |               |           RepartitionExec: 
partitioning=HashSelectionVector([alias1@0], 12), input_partitions=12 |
   |               |             RepartitionExec: 
partitioning=RoundRobinBatch(12), input_partitions=1                |
   |               |               AggregateExec: mode=Partial, gby=[c@0 as 
alias1], aggr=[]                          |
   |               |                 DataSourceExec: partitions=1, 
partition_sizes=[1]                                |
   |               |                                                            
                                      |
   
+---------------+--------------------------------------------------------------------------------------------------+
   ```
   
   I'll review more aggregation patterns and add additional tests.
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

Reply via email to