Re: [PR] Minor: remove unnecessary projection in `single_distinct_to_group_by` rule [arrow-datafusion]

via GitHub Tue, 07 Nov 2023 20:45:07 -0800


haohuaijin commented on code in PR #8061:
URL: https://github.com/apache/arrow-datafusion/pull/8061#discussion_r1385972626



##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
 GlobalLimitExec: skip=0, fetch=10
 --SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
 ----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT 
partsupp.ps_suppkey)]
 ----------CoalesceBatchesExec: target_batch_size=8192
 ------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, 
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]

Review Comment:
   I'm driectly run the query, the output column names is equal.
   In this pr
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | c  | 5                                    |
   | b  | 5                                    |
   | d  | 5                                    |
   | e  | 5                                    |
   | a  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.048 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]]  |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |     TableScan: /tmp/test/agg.csv projection=[c1, c2]       
                                                  |
   | physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]          |
   |               |   CoalesceBatchesExec: target_batch_size=8192              
                                                  |
   |               |     RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |       AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]             |
   |               |         AggregateExec: mode=FinalPartitioned, gby=[c1@0 as 
c1, alias1@1 as alias1], aggr=[]                  |
   |               |           CoalesceBatchesExec: target_batch_size=8192      
                                                  |
   |               |             RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |               AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                 RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                   CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.030 seconds.
   ```
   In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | e  | 5                                    |
   | b  | 5                                    |
   | c  | 5                                    |
   | a  | 5                                    |
   | d  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.047 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS 
COUNT(DISTINCT /tmp/test/agg.csv.c2)                        |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1)]]                                          |
   |               |     Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |       TableScan: /tmp/test/agg.csv projection=[c1, c2]     
                                                    |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as 
COUNT(DISTINCT /tmp/test/agg.csv.c2)]                     |
   |               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                 |
   |               |     CoalesceBatchesExec: target_batch_size=8192            
                                                    |
   |               |       RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |         AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                    |
   |               |           AggregateExec: mode=FinalPartitioned, gby=[c1@0 
as c1, alias1@1 as alias1], aggr=[]                  |
   |               |             CoalesceBatchesExec: target_batch_size=8192    
                                                    |
   |               |               RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |                 AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                   RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                     CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.027 seconds.
   ```



##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
 GlobalLimitExec: skip=0, fetch=10
 --SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
 ----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT 
partsupp.ps_suppkey)]
 ----------CoalesceBatchesExec: target_batch_size=8192
 ------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, 
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]

Review Comment:
   I'm driectly run the query, the output column names is equal.
   In this pr
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | c  | 5                                    |
   | b  | 5                                    |
   | d  | 5                                    |
   | e  | 5                                    |
   | a  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.048 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]]  |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |     TableScan: /tmp/test/agg.csv projection=[c1, c2]       
                                                  |
   | physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]          |
   |               |   CoalesceBatchesExec: target_batch_size=8192              
                                                  |
   |               |     RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |       AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]             |
   |               |         AggregateExec: mode=FinalPartitioned, gby=[c1@0 as 
c1, alias1@1 as alias1], aggr=[]                  |
   |               |           CoalesceBatchesExec: target_batch_size=8192      
                                                  |
   |               |             RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |               AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                 RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                   CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.030 seconds.
   ```
   In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | e  | 5                                    |
   | b  | 5                                    |
   | c  | 5                                    |
   | a  | 5                                    |
   | d  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.047 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS 
COUNT(DISTINCT /tmp/test/agg.csv.c2)                        |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1)]]                                          |
   |               |     Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |       TableScan: /tmp/test/agg.csv projection=[c1, c2]     
                                                    |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as 
COUNT(DISTINCT /tmp/test/agg.csv.c2)]                     |
   |               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                 |
   |               |     CoalesceBatchesExec: target_batch_size=8192            
                                                    |
   |               |       RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |         AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                    |
   |               |           AggregateExec: mode=FinalPartitioned, gby=[c1@0 
as c1, alias1@1 as alias1], aggr=[]                  |
   |               |             CoalesceBatchesExec: target_batch_size=8192    
                                                    |
   |               |               RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |                 AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                   RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                     CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.027 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Minor: remove unnecessary projection in `single_distinct_to_group_by` rule [arrow-datafusion]

Reply via email to