haohuaijin commented on code in PR #8061:
URL: https://github.com/apache/arrow-datafusion/pull/8061#discussion_r1385972626
##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
GlobalLimitExec: skip=0, fetch=10
--SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT
partsupp.ps_suppkey)]
----------CoalesceBatchesExec: target_batch_size=8192
------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1,
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]
Review Comment:
becuase the optimizer check the schema consistent before and after
optimization, so the output column names is never change, the below is my test
by running the query directly.
In this pr
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| c | 5 |
| b | 5 |
| d | 5 |
| e | 5 |
| a | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.048 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+--------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+--------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as
c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+--------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.030 seconds.
```
In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| e | 5 |
| b | 5 |
| c | 5 |
| a | 5 |
| d | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.047 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS
COUNT(DISTINCT /tmp/test/agg.csv.c2) |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as
COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0
as c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.027 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]