Re: [PR] Minor: remove unnecessary projection in `single_distinct_to_group_by` rule [arrow-datafusion]

via GitHub Wed, 08 Nov 2023 00:14:10 -0800


haohuaijin commented on code in PR #8061:
URL: https://github.com/apache/arrow-datafusion/pull/8061#discussion_r1385972626



##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
 GlobalLimitExec: skip=0, fetch=10
 --SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
 ----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS 
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, 
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT 
partsupp.ps_suppkey)]
 ----------CoalesceBatchesExec: target_batch_size=8192
 ------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, 
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1 
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]

Review Comment:
   becuase the optimizer check the schema consistent before and after 
optimization, so the output column names is never change, the below is my test 
by running the query directly.
   In this pr
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | c  | 5                                    |
   | b  | 5                                    |
   | d  | 5                                    |
   | e  | 5                                    |
   | a  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.048 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]]  |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |     TableScan: /tmp/test/agg.csv projection=[c1, c2]       
                                                  |
   | physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]          |
   |               |   CoalesceBatchesExec: target_batch_size=8192              
                                                  |
   |               |     RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |       AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)]             |
   |               |         AggregateExec: mode=FinalPartitioned, gby=[c1@0 as 
c1, alias1@1 as alias1], aggr=[]                  |
   |               |           CoalesceBatchesExec: target_batch_size=8192      
                                                  |
   |               |             RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |               AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                 RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                   CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                  |
   
+---------------+--------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.030 seconds.
   ```
   In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
   ```
   ❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   +----+--------------------------------------+
   | c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
   +----+--------------------------------------+
   | e  | 5                                    |
   | b  | 5                                    |
   | c  | 5                                    |
   | a  | 5                                    |
   | d  | 5                                    |
   +----+--------------------------------------+
   5 rows in set. Query took 0.047 seconds.
   
   ❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS 
COUNT(DISTINCT /tmp/test/agg.csv.c2)                        |
   |               |   Aggregate: groupBy=[[/tmp/test/agg.csv.c1]], 
aggr=[[COUNT(alias1)]]                                          |
   |               |     Aggregate: groupBy=[[/tmp/test/agg.csv.c1, 
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]]                     |
   |               |       TableScan: /tmp/test/agg.csv projection=[c1, c2]     
                                                    |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as 
COUNT(DISTINCT /tmp/test/agg.csv.c2)]                     |
   |               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                 |
   |               |     CoalesceBatchesExec: target_batch_size=8192            
                                                    |
   |               |       RepartitionExec: partitioning=Hash([c1@0], 24), 
input_partitions=24                                      |
   |               |         AggregateExec: mode=Partial, gby=[c1@0 as c1], 
aggr=[COUNT(alias1)]                                    |
   |               |           AggregateExec: mode=FinalPartitioned, gby=[c1@0 
as c1, alias1@1 as alias1], aggr=[]                  |
   |               |             CoalesceBatchesExec: target_batch_size=8192    
                                                    |
   |               |               RepartitionExec: partitioning=Hash([c1@0, 
alias1@1], 24), input_partitions=24                    |
   |               |                 AggregateExec: mode=Partial, gby=[c1@0 as 
c1, c2@1 as alias1], aggr=[]                         |
   |               |                   RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1                        |
   |               |                     CsvExec: file_groups={1 group: 
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
   |               |                                                            
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.027 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Minor: remove unnecessary projection in `single_distinct_to_group_by` rule [arrow-datafusion]

Reply via email to