haohuaijin commented on code in PR #8061:
URL: https://github.com/apache/arrow-datafusion/pull/8061#discussion_r1385972626
##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
GlobalLimitExec: skip=0, fetch=10
--SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT
partsupp.ps_suppkey)]
----------CoalesceBatchesExec: target_batch_size=8192
------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1,
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]
Review Comment:
I'm driectly run the query, the output column names is equal.
In this pr
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| c | 5 |
| b | 5 |
| d | 5 |
| e | 5 |
| a | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.048 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+--------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+--------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as
c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+--------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.030 seconds.
```
In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| e | 5 |
| b | 5 |
| c | 5 |
| a | 5 |
| d | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.047 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS
COUNT(DISTINCT /tmp/test/agg.csv.c2) |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as
COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0
as c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.027 seconds.
```
##########
datafusion/sqllogictest/test_files/tpch/q16.slt.part:
##########
@@ -69,11 +69,11 @@ physical_plan
GlobalLimitExec: skip=0, fetch=10
--SortPreservingMergeExec: [supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST], fetch=10
----SortExec: TopK(fetch=10), expr=[supplier_cnt@3 DESC,p_brand@0 ASC NULLS
LAST,p_type@1 ASC NULLS LAST,p_size@2 ASC NULLS LAST]
-------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(alias1)@3 as supplier_cnt]
---------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+------ProjectionExec: expr=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2
as p_size, COUNT(DISTINCT partsupp.ps_suppkey)@3 as supplier_cnt]
+--------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand,
p_type@1 as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT
partsupp.ps_suppkey)]
----------CoalesceBatchesExec: target_batch_size=8192
------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1,
p_size@2], 4), input_partitions=4
---------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(alias1)]
+--------------AggregateExec: mode=Partial, gby=[p_brand@0 as p_brand, p_type@1
as p_type, p_size@2 as p_size], aggr=[COUNT(DISTINCT partsupp.ps_suppkey)]
Review Comment:
I'm driectly run the query, the output column names is equal.
In this pr
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| c | 5 |
| b | 5 |
| d | 5 |
| e | 5 |
| a | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.048 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+--------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+--------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1) AS COUNT(DISTINCT /tmp/test/agg.csv.c2)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as
c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+--------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.030 seconds.
```
In main 3469c4e09a3d32381949dd0c0f626f406c00c6ad
```
❯ select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+----+--------------------------------------+
| c1 | COUNT(DISTINCT /tmp/test/agg.csv.c2) |
+----+--------------------------------------+
| e | 5 |
| b | 5 |
| c | 5 |
| a | 5 |
| d | 5 |
+----+--------------------------------------+
5 rows in set. Query took 0.047 seconds.
❯ explain select c1, count(distinct c2) from '/tmp/test/agg.csv' group by c1;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: /tmp/test/agg.csv.c1, COUNT(alias1) AS
COUNT(DISTINCT /tmp/test/agg.csv.c2) |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1]],
aggr=[[COUNT(alias1)]] |
| | Aggregate: groupBy=[[/tmp/test/agg.csv.c1,
/tmp/test/agg.csv.c2 AS alias1]], aggr=[[]] |
| | TableScan: /tmp/test/agg.csv projection=[c1, c2]
|
| physical_plan | ProjectionExec: expr=[c1@0 as c1, COUNT(alias1)@1 as
COUNT(DISTINCT /tmp/test/agg.csv.c2)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0], 24),
input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as c1],
aggr=[COUNT(alias1)] |
| | AggregateExec: mode=FinalPartitioned, gby=[c1@0
as c1, alias1@1 as alias1], aggr=[] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([c1@0,
alias1@1], 24), input_partitions=24 |
| | AggregateExec: mode=Partial, gby=[c1@0 as
c1, c2@1 as alias1], aggr=[] |
| | RepartitionExec:
partitioning=RoundRobinBatch(24), input_partitions=1 |
| | CsvExec: file_groups={1 group:
[[tmp/test/agg.csv]]}, projection=[c1, c2], has_header=true |
| |
|
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.027 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]