alamb commented on issue #5942: URL: https://github.com/apache/arrow-datafusion/issues/5942#issuecomment-1507271239
The explain plan clearly shows DataFusion trying to parallelize the scan:
```
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: lineitem.parquet.l_returnflag ASC NULLS LAST,
lineitem.parquet.l_linestatus ASC NULLS LAST
|
| | Projection: lineitem.parquet.l_returnflag,
lineitem.parquet.l_linestatus, SUM(lineitem.parquet.l_quantity) AS sum_qty,
SUM(lineitem.parquet.l_extendedprice) AS sum_base_price,
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount)
AS sum_disc_price, SUM(lineitem.parquet.l_extendedprice * Int64(1) -
lineitem.parquet.l_discount * Int64(1) + lineitem.parquet.l_tax) AS sum_charge,
AVG(lineitem.parquet.l_quantity) AS avg_qty,
AVG(lineitem.parquet.l_extendedprice) AS avg_price,
AVG(lineitem.parquet.l_discount) AS avg_disc, COUNT(UInt8(1)) AS count_order
|
| | Aggregate: groupBy=[[lineitem.parquet.l_returnflag,
lineitem.parquet.l_linestatus]], aggr=[[SUM(lineitem.parquet.l_quantity),
SUM(lineitem.parquet.l_extendedprice),
SUM(CAST(lineitem.parquet.l_extendedprice AS Decimal128(38, 4)) *
CAST(Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2)) AS Decimal128(38, 4))CAST(Decimal128(Some(100),23,2) -
CAST(lineitem.parquet.l_discount AS Decimal128(23, 2)) AS Decimal128(38,
4))Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2))CAST(lineitem.parquet.l_discount AS Decimal128(23,
2))lineitem.parquet.l_discountDecimal128(Some(100),23,2)CAST(lineitem.parquet.l_extendedprice
AS Decimal128(38, 4))lineitem.parquet.l_extendedprice AS
lineitem.parquet.l_extendedprice * Decimal128(Some(100),23,2) -
lineitem.parquet.l_discount) AS SUM(lineitem.parquet.l_extendedprice * Int64(1)
- lineitem.parquet.l_discount), SUM(CAST(CAST(lineitem.parquet.l_extendedprice
AS Decimal128(
38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount
AS Decimal128(23, 2)) AS Decimal128(38, 4))CAST(Decimal128(Some(100),23,2) -
CAST(lineitem.parquet.l_discount AS Decimal128(23, 2)) AS Decimal128(38,
4))Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2))CAST(lineitem.parquet.l_discount AS Decimal128(23,
2))lineitem.parquet.l_discountDecimal128(Some(100),23,2)CAST(lineitem.parquet.l_extendedprice
AS Decimal128(38, 4))lineitem.parquet.l_extendedprice AS
lineitem.parquet.l_extendedprice * Decimal128(Some(100),23,2) -
lineitem.parquet.l_discount AS Decimal128(38, 6)) *
CAST(Decimal128(Some(100),23,2) + CAST(lineitem.parquet.l_tax AS Decimal128(23,
2)) AS Decimal128(38, 6))) AS SUM(lineitem.parquet.l_extendedprice * Int64(1) -
lineitem.parquet.l_discount * Int64(1) + lineitem.parquet.l_tax),
AVG(lineitem.parquet.l_quantity), AVG(lineitem.parquet.l_extendedprice),
AVG(lineitem.parquet.l_discount), COUNT(UInt8(1))]] |
| | Projection: CAST(lineitem.parquet.l_extendedprice AS
Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) -
CAST(lineitem.parquet.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) AS
CAST(lineitem.parquet.l_extendedprice AS Decimal128(38, 4)) *
CAST(Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2)) AS Decimal128(38, 4))CAST(Decimal128(Some(100),23,2) -
CAST(lineitem.parquet.l_discount AS Decimal128(23, 2)) AS Decimal128(38,
4))Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2))CAST(lineitem.parquet.l_discount AS Decimal128(23,
2))lineitem.parquet.l_discountDecimal128(Some(100),23,2)CAST(lineitem.parquet.l_extendedprice
AS Decimal128(38, 4))lineitem.parquet.l_extendedprice,
lineitem.parquet.l_quantity, lineitem.parquet.l_extendedprice,
lineitem.parquet.l_discount, lineitem.parquet.l_tax,
lineitem.parquet.l_returnflag, lineitem.parquet.l_linestatus
|
| | Filter: lineitem.parquet.l_shipdate <=
Date32("10471")
|
| | TableScan: lineitem.parquet
projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag,
l_linestatus, l_shipdate], partial_filters=[lineitem.parquet.l_shipdate <=
Date32("10471")]
|
| physical_plan | SortPreservingMergeExec: [l_returnflag@0 ASC NULLS
LAST,l_linestatus@1 ASC NULLS LAST]
|
| | SortExec: expr=[l_returnflag@0 ASC NULLS
LAST,l_linestatus@1 ASC NULLS LAST]
|
| | ProjectionExec: expr=[l_returnflag@0 as l_returnflag,
l_linestatus@1 as l_linestatus, SUM(lineitem.parquet.l_quantity)@2 as sum_qty,
SUM(lineitem.parquet.l_extendedprice)@3 as sum_base_price,
SUM(lineitem.parquet.l_extendedprice * Int64(1) -
lineitem.parquet.l_discount)@4 as sum_disc_price,
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount *
Int64(1) + lineitem.parquet.l_tax)@5 as sum_charge,
AVG(lineitem.parquet.l_quantity)@6 as avg_qty,
AVG(lineitem.parquet.l_extendedprice)@7 as avg_price,
AVG(lineitem.parquet.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as
count_order]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus],
aggr=[SUM(lineitem.parquet.l_quantity), SUM(lineitem.parquet.l_extendedprice),
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount),
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount *
Int64(1) + lineitem.parquet.l_tax), AVG(lineitem.parquet.l_quantity),
AVG(lineitem.parquet.l_extendedprice), AVG(lineitem.parquet.l_discount),
COUNT(UInt8(1))]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([Column {
name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }],
8), input_partitions=8
|
| | AggregateExec: mode=Partial,
gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus],
aggr=[SUM(lineitem.parquet.l_quantity), SUM(lineitem.parquet.l_extendedprice),
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount),
SUM(lineitem.parquet.l_extendedprice * Int64(1) - lineitem.parquet.l_discount *
Int64(1) + lineitem.parquet.l_tax), AVG(lineitem.parquet.l_quantity),
AVG(lineitem.parquet.l_extendedprice), AVG(lineitem.parquet.l_discount),
COUNT(UInt8(1))]
|
| | ProjectionExec: expr=[CAST(l_extendedprice@1
AS Decimal128(38, 4)) * CAST(Some(100),23,2 - CAST(l_discount@2 AS
Decimal128(23, 2)) AS Decimal128(38, 4)) as
CAST(lineitem.parquet.l_extendedprice AS Decimal128(38, 4)) *
CAST(Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2)) AS Decimal128(38, 4))CAST(Decimal128(Some(100),23,2) -
CAST(lineitem.parquet.l_discount AS Decimal128(23, 2)) AS Decimal128(38,
4))Decimal128(Some(100),23,2) - CAST(lineitem.parquet.l_discount AS
Decimal128(23, 2))CAST(lineitem.parquet.l_discount AS Decimal128(23,
2))lineitem.parquet.l_discountDecimal128(Some(100),23,2)CAST(lineitem.parquet.l_extendedprice
AS Decimal128(38, 4))lineitem.parquet.l_extendedprice, l_quantity@0 as
l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount,
l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as
l_linestatus]
|
| | CoalesceBatchesExec:
target_batch_size=8192
|
| | FilterExec: l_shipdate@6 <= 10471
|
| | ParquetExec: limit=None, partitions={8
groups: [[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:0..20797508],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:20797508..41595016],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:41595016..62392524],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:62392524..83190032],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:83190032..103987540],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:103987540..124785048],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:124785048..145582556],
[home/alamb/tpch_data/parquet_data_SF1/lineitem.parquet:145582556..166380062]]},
predicate=l_shipdate@10 <= 10471, pruning_predicate=l_shipdate_min@0 <= 10471,
projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag,
l_linestatus, l_shipdate]
|
| |
|
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
Here is the duckdb generated schema:
```
❯ create external table lineitem stored as parquet location
'lineitem.parquet';
0 rows in set. Query took 0.003 seconds.
❯ describe lineitem;
+-----------------+-------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey | Int32 | YES |
| l_partkey | Int32 | YES |
| l_suppkey | Int32 | YES |
| l_linenumber | Int32 | YES |
| l_quantity | Int32 | YES |
| l_extendedprice | Decimal128(15, 2) | YES |
| l_discount | Decimal128(15, 2) | YES |
| l_tax | Decimal128(15, 2) | YES |
| l_returnflag | Utf8 | YES |
| l_linestatus | Utf8 | YES |
| l_shipdate | Date32 | YES |
| l_commitdate | Date32 | YES |
| l_receiptdate | Date32 | YES |
| l_shipinstruct | Utf8 | YES |
| l_shipmode | Utf8 | YES |
| l_comment | Utf8 | YES |
+-----------------+-------------------+-------------+
16 rows in set. Query took 0.001 seconds.
```
Here is the datafusion generated schema:
```
❯ describe lineitem;
+-----------------+-------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey | Int64 | NO |
| l_partkey | Int64 | NO |
| l_suppkey | Int64 | NO |
| l_linenumber | Int32 | NO |
| l_quantity | Decimal128(15, 2) | NO |
| l_extendedprice | Decimal128(15, 2) | NO |
| l_discount | Decimal128(15, 2) | NO |
| l_tax | Decimal128(15, 2) | NO |
| l_returnflag | Utf8 | NO |
| l_linestatus | Utf8 | NO |
| l_shipdate | Date32 | NO |
| l_commitdate | Date32 | NO |
| l_receiptdate | Date32 | NO |
| l_shipinstruct | Utf8 | NO |
| l_shipmode | Utf8 | NO |
| l_comment | Utf8 | NO |
+-----------------+-------------------+-------------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
