ic4y opened a new issue #1282:
URL: https://github.com/apache/arrow-datafusion/issues/1282
In most query engines, the execution cost of the distinct aggregation
function is huge, but it can be optimized by groupBy. I want to bring this
optimization to datafushion
Currently for a single distinct aggregation scenario as follows
```
- Aggregation
GROUP BY (k)
F1(DISTINCT s0, s1, ...),
F2(DISTINCT s0, s1, ...),
- X
into
- Aggregation
GROUP BY (k)
F1(x)
F2(x)
- Aggregation
GROUP BY (k, s0, s1, ...)
- X
```
I used a test data set of 60 million to test datafunshion before and after
using the optimizer.After optimization,the performance has doubled and the
execution time has been reduced from 12 seconds to 6 seconds
The test results and the logical plan before and after optimization are as
follows
```
sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;
------------------original---------------------
Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE)
[COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT
#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT
lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+-------------------------------------------------+
| COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
+-------------------------------------------------+
| 1040570 |
+-------------------------------------------------+
usage millis: 12033
----------------after optimization-------------
Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE)
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]]
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]]
[LO_EXTENDEDPRICE:Int64]
TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+----------------------------------------+
| COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
+----------------------------------------+
| 1040570 |
+----------------------------------------+
usage millis: 5817
```
In the case of common aggregation functions and distinct aggregation
functions used together, optimization can also be done in a way similar to
GROUPING SET. Although it has not been tested on datafushion, I did the above
optimization in trino. In our production environment Has a very good performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]