haohuaijin opened a new issue, #8123:
URL: https://github.com/apache/arrow-datafusion/issues/8123
### Is your feature request related to a problem or challenge?
In current `single_distinct_to_group_by` rule, we only support all aggregate
function are distinct aggregate.
But if the no distinct aggregate is count/min/max/sum, we can also do same
transform like in `single_distinct_to_group_by`
before
```sql
select a, count(distinct b), count(c)
from t
group by a
```
after
```sql
select a, count(alias1), sum(alias2)
from (
select a, b as alias1, count(c) as alias2
from t
group by a, b
)
group by a
```
### Describe the solution you'd like
By write, we can improve the perfmance of distinct aggregate
```
❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" LIMIT
10;
+----------+--------------------------------------------------+--------------------------------------------------------+
| RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) |
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
+----------+--------------------------------------------------+--------------------------------------------------------+
| 0 | 0 | 8
|
| 1 | 147946 | 239380
|
| 2 | 441662 | 1081016
|
| 3 | 39724 | 131195
|
| 4 | 34557 | 79500
|
| 5 | 13502 | 40914
|
| 6 | 24338 | 55768
|
| 7 | 28417 | 64989
|
| 8 | 34483 | 65472
|
| 9 | 38047 | 91576
|
+----------+--------------------------------------------------+--------------------------------------------------------+
10 rows in set. Query took 1.357 seconds.
❯ SELECT "RegionID", SUM(t1), count("UserID") from (select "UserID",
"RegionID", sum("AdvEngineID") as t1 from '../benchmarks/data/hits.parquet'
group by "UserID", "RegionID") group by "RegionID" order by "RegionID" limit 10;
+----------+---------+-----------------------------------------------+
| RegionID | SUM(t1) | COUNT(../benchmarks/data/hits.parquet.UserID) |
+----------+---------+-----------------------------------------------+
| 0 | 0 | 8 |
| 1 | 147946 | 239380 |
| 2 | 441662 | 1081016 |
| 3 | 39724 | 131195 |
| 4 | 34557 | 79500 |
| 5 | 13502 | 40914 |
| 6 | 24338 | 55768 |
| 7 | 28417 | 64989 |
| 8 | 34483 | 65472 |
| 9 | 38047 | 91576 |
+----------+---------+-----------------------------------------------+
10 rows in set. Query took 0.919 seconds.
```
### Describe alternatives you've considered
_No response_
### Additional context
https://www.querifylabs.com/blog/distinct-aggregation-optimization-in-apache-calcite-and-trino
https://github.com/apache/calcite/blob/96b05ee12f936ed057265072ff6a2de8ea0a249e/core/src/main/java/org/apache/calcite/rel/rules/AggregateExpandDistinctAggregatesRule.java#L286-L298
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]