[GitHub] [arrow-datafusion] ic4y opened a new issue #1282: better performance improvement by optimizing a single distinct aggregation scene

GitBox Thu, 11 Nov 2021 01:04:16 -0800


ic4y opened a new issue #1282:
URL: https://github.com/apache/arrow-datafusion/issues/1282



   In most query engines, the execution cost of the distinct aggregation 
function is huge, but it can be optimized by groupBy. I want to bring this 
optimization to datafushion
   
   Currently for a single  distinct aggregation scenario as follows
   ```
   - Aggregation
          GROUP BY (k)
          F1(DISTINCT s0, s1, ...),
          F2(DISTINCT s0, s1, ...),
       - X
   into
   - Aggregation
            GROUP BY (k)
            F1(x)
            F2(x)
        - Aggregation
               GROUP BY (k, s0, s1, ...)
            - X
   ```
   
   I used a test data set of 60 million to test datafunshion before and after 
using the optimizer.After optimization,the performance has doubled and the 
execution time has been reduced from 12 seconds to 6 seconds
   The test results and the logical plan before and after optimization are as 
follows
   
   ```
   sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;
   
   ------------------original---------------------
   Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) 
[COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
     Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT 
#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT 
lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
       TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
   +-------------------------------------------------+
   | COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
   +-------------------------------------------------+
   | 1040570                                         |
   +-------------------------------------------------+
   usage millis: 12033
   
   ----------------after optimization-------------
   Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) 
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
     Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] 
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
       Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] 
[LO_EXTENDEDPRICE:Int64]
         TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
   +----------------------------------------+
   | COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
   +----------------------------------------+
   | 1040570                                |
   +----------------------------------------+
   usage millis: 5817
   
   ```
   
   In the case of common aggregation functions and distinct aggregation 
functions used together, optimization can also be done in a way similar to 
GROUPING SET. Although it has not been tested on datafushion, I did the above 
optimization in trino. In our production environment Has a very good performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ic4y opened a new issue #1282: better performance improvement by optimizing a single distinct aggregation scene

Reply via email to