chenkovsky opened a new pull request, #16161: URL: https://github.com/apache/datafusion/pull/16161
## Which issue does this PR close? ## Rationale for this change first, it seems that grouping udaf document is not correct. and for aggregation with grouping, e.g. ```sql CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values ('a','A',1), ('b','B',2); EXPLAIN FORMAT INDENT select c1, c2, CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx, grouping(c1) as g0, grouping(c2) as g1, grouping(c1, c2) as g2, grouping(c2, c1) as g3 from test group by grouping sets ( (c1, c2), (c1), (c2), () ); ``` current logical plan is: ``` | logical_plan | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3 | | | Projection: test.c1, test.c2, CAST(__common_expr_1 AS Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS grouping(test.c2,test.c1) | | | Projection: __grouping_id & UInt8(2) >> UInt8(1) AS __common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, __grouping_id, sum(test.c3) | | | Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]] | | | TableScan: test projection=[c1, c2, c3] ``` the problems are: 1. there are three projections. for bitwise operation, there's no benifit for extra projection. 2. it makes grouping level optimization very hard. for example, ```sql select case when grouping(c1) = 1 then sum(c2) else null end from test group by grouping sets (...) ``` we only need to calculate sum(c2) in when grouping(c1) = 1, this is a very useful optimization trick. I'm also using this trick to optimizing sql with multiple count distinct. 3. unparsing is not supported. Internal("Tried to unproject column referring to internal grouping id") will be thrown. ## What changes are included in this PR? 1. create a grouping udf. 2. modify analyzer, replace grouping_udaf with grouping udf 3. change it back, when unparsing. now, the logical plan is: ``` | logical_plan | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3 | | | Projection: test.c1, test.c2, grouping(__grouping_id, List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, List([0])) AS grouping(test.c2), grouping(__grouping_id) AS grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS grouping(test.c2,test.c1) | | | Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]] | | | TableScan: test projection=[c1, c2, c3] ``` there are only two projections. and unparsing is supported. ## Are these changes tested? UT ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org