[PR] feat: optimize and unparse grouping [datafusion]

via GitHub Fri, 23 May 2025 00:57:59 -0700


chenkovsky opened a new pull request, #16161:
URL: https://github.com/apache/datafusion/pull/16161


   ## Which issue does this PR close?
   
   
   ## Rationale for this change
   
   first, it seems that grouping udaf document is not correct.
   
   and for aggregation with grouping,
   
   e.g.
   ```sql
   CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values
   ('a','A',1), ('b','B',2);
   
   EXPLAIN FORMAT INDENT select
     c1,
     c2,
     CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx,
     grouping(c1) as g0,
     grouping(c2) as g1,
     grouping(c1, c2) as g2,
     grouping(c2, c1) as g3
   from
     test
   group by
     grouping sets (
       (c1, c2),
       (c1),
       (c2),
       ()
     );
   ```
   
   current logical plan is:
   ```
   | logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) 
= Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS 
g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, 
grouping(test.c2,test.c1) AS g3                                                 
                                                             |
   |               |   Projection: test.c1, test.c2, CAST(__common_expr_1 AS 
Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS 
grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), 
CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS 
grouping(test.c2,test.c1)                                       |
   |               |     Projection: __grouping_id & UInt8(2) >> UInt8(1) AS 
__common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, 
__grouping_id, sum(test.c3)                                                     
                                                                                
                                           |
   |               |       Aggregate: groupBy=[[GROUPING SETS ((test.c1, 
test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]     
                                                                                
                                                                                
                                               |
   |               |         TableScan: test projection=[c1, c2, c3]  
   ```
   
   the problems are:
   
   1. there are three projections. for bitwise operation, there's no benifit 
for extra projection.
   2. it makes grouping level optimization very hard. for example,
   
       ```sql
        select  case when grouping(c1) = 1 then sum(c2) else null end from test 
group by grouping sets (...)
       ```
      we only need to calculate sum(c2) in when grouping(c1) = 1, this is a 
very useful optimization trick. I'm also using this trick to optimizing sql 
with multiple count distinct.
       
   3. unparsing is not supported. Internal("Tried to unproject column referring 
to internal grouping id") will be thrown.
   
   
   ## What changes are included in this PR?
   
   1. create a grouping udf.
   2. modify analyzer, replace grouping_udaf with grouping udf
   3. change it back, when unparsing.
   
   now, the logical plan is:
   
   ```
   | logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) 
= Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS 
g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, 
grouping(test.c2,test.c1) AS g3                                                 
                            |
   |               |   Projection: test.c1, test.c2, grouping(__grouping_id, 
List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, 
List([0])) AS grouping(test.c2), grouping(__grouping_id) AS 
grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS 
grouping(test.c2,test.c1)                          |
   |               |     Aggregate: groupBy=[[GROUPING SETS ((test.c1, 
test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]     
                                                                                
                                                                                
                |
   |               |       TableScan: test projection=[c1, c2, c3]              
                                                                                
                                                                               
   
   ```
   
   there are only two projections. and unparsing is supported.
   
   ## Are these changes tested?
   
   UT
   
   ## Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: optimize and unparse grouping [datafusion]

Reply via email to