haohuaijin opened a new pull request, #8124:
URL: https://github.com/apache/arrow-datafusion/pull/8124

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #8123 
   
   ## Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   In this pr
   ```
   ❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM 
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" limit 
10;
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) | 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | 0        | 0                                                | 8            
                                          |
   | 1        | 147946                                           | 239380       
                                          |
   | 2        | 441662                                           | 1081016      
                                          |
   | 3        | 39724                                            | 131195       
                                          |
   | 4        | 34557                                            | 79500        
                                          |
   | 5        | 13502                                            | 40914        
                                          |
   | 6        | 24338                                            | 55768        
                                          |
   | 7        | 28417                                            | 64989        
                                          |
   | 8        | 34483                                            | 65472        
                                          |
   | 9        | 38047                                            | 91576        
                                          |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   10 rows in set. Query took 0.945 seconds.
   ❯ explain SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") 
FROM '../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" 
limit 10;
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=10                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |   Sort: ../benchmarks/data/hits.parquet.RegionID ASC NULLS 
LAST, fetch=10                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |     Aggregate: 
groupBy=[[../benchmarks/data/hits.parquet.RegionID]], aggr=[[SUM(alias2) AS 
SUM(../benchmarks/data/hits.parquet.AdvEngineID), COUNT(alias1) AS 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]]                        
                                                                                
                                                                                
                                                                            |
   |               |       Aggregate: 
groupBy=[[../benchmarks/data/hits.parquet.RegionID, 
../benchmarks/data/hits.parquet.UserID AS alias1]], 
aggr=[[SUM(CAST(../benchmarks/data/hits.parquet.AdvEngineID AS Int64)) AS 
alias2]]                                                                        
                                                                                
                                                                                
                                       |
   |               |         TableScan: ../benchmarks/data/hits.parquet 
projection=[RegionID, UserID, AdvEngineID]                                      
                                                                                
                                                                                
                                                                                
                                                                                
                       |
   | physical_plan | GlobalLimitExec: skip=0, fetch=10                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |   SortPreservingMergeExec: [RegionID@0 ASC NULLS LAST], 
fetch=10                                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                  |
   |               |     SortExec: TopK(fetch=10), expr=[RegionID@0 ASC NULLS 
LAST]                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                 |
   |               |       AggregateExec: mode=FinalPartitioned, 
gby=[RegionID@0 as RegionID], 
aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID), COUNT(DISTINCT 
../benchmarks/data/hits.parquet.UserID)]                                        
                                                                                
                                                                                
                                                                                
         |
   |               |         CoalesceBatchesExec: target_batch_size=8192        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |           RepartitionExec: partitioning=Hash([RegionID@0], 
24), input_partitions=24                                                        
                                                                                
                                                                                
                                                                                
                                                                                
               |
   |               |             AggregateExec: mode=Partial, gby=[RegionID@0 
as RegionID], aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID), 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]                         
                                                                                
                                                                                
                                                                                
                           |
   |               |               AggregateExec: mode=FinalPartitioned, 
gby=[RegionID@0 as RegionID, alias1@1 as alias1], aggr=[alias2]                 
                                                                                
                                                                                
                                                                                
                                                                                
                      |
   |               |                 CoalesceBatchesExec: 
target_batch_size=8192                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                     |
   |               |                   RepartitionExec: 
partitioning=Hash([RegionID@0, alias1@1], 24), input_partitions=24              
                                                                                
                                                                                
                                                                                
                                                                                
                                       |
   |               |                     AggregateExec: mode=Partial, 
gby=[RegionID@0 as RegionID, UserID@1 as alias1], aggr=[alias2]                 
                                                                                
                                                                                
                                                                                
                                                                                
                         |
   |               |                       ParquetExec: file_groups={24 groups: 
[[home/hhj/datafusion/benchmarks/data/hits.parquet:0..615832352], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:615832352..1231664704], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:1231664704..1847497056], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:1847497056..2463329408], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:2463329408..3079161760], 
...]}, projection=[RegionID, UserID, AdvEngineID] |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.043 seconds.
   ```
   in main 43cc870a951611e9081a462d5a8a1686e87fce9a
   ```
   ❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM 
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" limit 
10;
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) | 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   | 0        | 0                                                | 8            
                                          |
   | 1        | 147946                                           | 239380       
                                          |
   | 2        | 441662                                           | 1081016      
                                          |
   | 3        | 39724                                            | 131195       
                                          |
   | 4        | 34557                                            | 79500        
                                          |
   | 5        | 13502                                            | 40914        
                                          |
   | 6        | 24338                                            | 55768        
                                          |
   | 7        | 28417                                            | 64989        
                                          |
   | 8        | 34483                                            | 65472        
                                          |
   | 9        | 38047                                            | 91576        
                                          |
   
+----------+--------------------------------------------------+--------------------------------------------------------+
   10 rows in set. Query took 1.343 seconds.
   ❯ explain SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") 
FROM '../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" 
limit 10;
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=10                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |               |   Sort: ../benchmarks/data/hits.parquet.RegionID ASC NULLS 
LAST, fetch=10                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |               |     Aggregate: 
groupBy=[[../benchmarks/data/hits.parquet.RegionID]], 
aggr=[[SUM(CAST(../benchmarks/data/hits.parquet.AdvEngineID AS Int64)), 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]]                        
                                                                                
                                                                                
                                                                                
     |
   |               |       TableScan: ../benchmarks/data/hits.parquet 
projection=[RegionID, UserID, AdvEngineID]                                      
                                                                                
                                                                                
                                                                                
                                                                                
                 |
   | physical_plan | GlobalLimitExec: skip=0, fetch=10                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |               |   SortPreservingMergeExec: [RegionID@0 ASC NULLS LAST], 
fetch=10                                                                        
                                                                                
                                                                                
                                                                                
                                                                                
          |
   |               |     SortExec: TopK(fetch=10), expr=[RegionID@0 ASC NULLS 
LAST]                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
         |
   |               |       AggregateExec: mode=FinalPartitioned, 
gby=[RegionID@0 as RegionID], 
aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID), COUNT(DISTINCT 
../benchmarks/data/hits.parquet.UserID)]                                        
                                                                                
                                                                                
                                                                                
 |
   |               |         CoalesceBatchesExec: target_batch_size=8192        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |               |           RepartitionExec: partitioning=Hash([RegionID@0], 
24), input_partitions=24                                                        
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |               |             AggregateExec: mode=Partial, gby=[RegionID@0 
as RegionID], aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID), 
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]                         
                                                                                
                                                                                
                                                                                
                   |
   |               |               ParquetExec: file_groups={24 groups: 
[[home/hhj/datafusion/benchmarks/data/hits.parquet:0..615832352], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:615832352..1231664704], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:1231664704..1847497056], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:1847497056..2463329408], 
[home/hhj/datafusion/benchmarks/data/hits.parquet:2463329408..3079161760], 
...]}, projection=[RegionID, UserID, AdvEngineID] |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.054 seconds.
   ```
   
   ## What changes are included in this PR?
   add no-distinct count/sum/min/max aggregate support in 
`single_distinct_to_group_by` rule
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ## Are these changes tested?
   yes, add soem tests
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to