sum aggregate in `single_distinct_to_group_by` rule [arrow-datafusion]

via GitHub Fri, 10 Nov 2023 06:30:37 -0800


haohuaijin opened a new issue, #8123:
URL: https://github.com/apache/arrow-datafusion/issues/8123


   ### Is your feature request related to a problem or challenge?
   
   In current `single_distinct_to_group_by` rule, we only support all aggregate 
function are distinct aggregate.
   But if the no distinct aggregate is count/min/max/sum, we can also do same 
transform like in `single_distinct_to_group_by`
   before
   ```sql
   select a, count(distinct b), count(c)
   from t
   group by a
   ```
   after
   ```sql
   select a, count(alias1), sum(alias2)
   from (
     select a, b as alias1, count(c) as alias2
     from t
     group by a, b
   )
   group by a
   ```
   
   ### Describe the solution you'd like
   
   By write, we can improve the perfmance of distinct aggregate
   ```
   ❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM 
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" LIMIT 
10;
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) | 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | 0        | 0                                                | 8            
                                          |
   | 1        | 147946                                           | 239380       
                                          |
   | 2        | 441662                                           | 1081016      
                                          |
   | 3        | 39724                                            | 131195       
                                          |
   | 4        | 34557                                            | 79500        
                                          |
   | 5        | 13502                                            | 40914        
                                          |
   | 6        | 24338                                            | 55768        
                                          |
   | 7        | 28417                                            | 64989        
                                          |
   | 8        | 34483                                            | 65472        
                                          |
   | 9        | 38047                                            | 91576        
                                          |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   10 rows in set. Query took 1.357 seconds.
   
   ❯ SELECT "RegionID", SUM(t1),  count("UserID") from (select "UserID", 
"RegionID", sum("AdvEngineID") as t1 from '../benchmarks/data/hits.parquet' 
group by "UserID", "RegionID") group by "RegionID" order by "RegionID" limit 10;
   +----------+---------+-----------------------------------------------+
   | RegionID | SUM(t1) | COUNT(../benchmarks/data/hits.parquet.UserID) |
   +----------+---------+-----------------------------------------------+
   | 0        | 0       | 8                                             |
   | 1        | 147946  | 239380                                        |
   | 2        | 441662  | 1081016                                       |
   | 3        | 39724   | 131195                                        |
   | 4        | 34557   | 79500                                         |
   | 5        | 13502   | 40914                                         |
   | 6        | 24338   | 55768                                         |
   | 7        | 28417   | 64989                                         |
   | 8        | 34483   | 65472                                         |
   | 9        | 38047   | 91576                                         |
   +----------+---------+-----------------------------------------------+
   10 rows in set. Query took 0.919 seconds.
   ```
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   
https://www.querifylabs.com/blog/distinct-aggregation-optimization-in-apache-calcite-and-trino
   
https://github.com/apache/calcite/blob/96b05ee12f936ed057265072ff6a2de8ea0a249e/core/src/main/java/org/apache/calcite/rel/rules/AggregateExpandDistinctAggregatesRule.java#L286-L298


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support no distinct count/max/min/sum aggregate in `single_distinct_to_group_by` rule [arrow-datafusion]

Reply via email to