ic4y opened a new pull request #1315:
URL: https://github.com/apache/arrow-datafusion/pull/1315


   # Which issue does this PR close?
   #1282,#1312
   
   # Rationale for this change
   Improve the performance of single_distinct_agg
   
   a single distinct aggregation optimization method as follows:
   
   ```
   - Aggregation
          GROUP BY (k)
          F1(DISTINCT s0, s1, ...),
          F2(DISTINCT s0, s1, ...),
       - X
   into
   - Aggregation
            GROUP BY (k)
            F1(x)
            F2(x)
        - Aggregation
               GROUP BY (k, s0, s1, ...)
            - X
   
   ```
   
   I used a test data set of 60 million to test datafunshion before and after 
using the optimizer. After optimization, the performance has double improvement 
and the execution time has been reduced from 12 seconds to 6 seconds.
   The test results and the logical plan before and after optimization are as 
follows:
   
   ```
   sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;
   
   ------------------original---------------------
   Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) 
[COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
     Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT 
#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT 
lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
       TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
   +-------------------------------------------------+
   | COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
   +-------------------------------------------------+
   | 1040570                                         |
   +-------------------------------------------------+
   usage millis: 12033
   
   ----------------after optimization-------------
   Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) 
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
     Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] 
[COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
       Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] 
[LO_EXTENDEDPRICE:Int64]
         TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
   +----------------------------------------+
   | COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
   +----------------------------------------+
   | 1040570                                |
   +----------------------------------------+
   usage millis: 5817
   
   ```
   
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   nothing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to