Ted-Jiang commented on PR #5099: URL: https://github.com/apache/arrow-datafusion/pull/5099#issuecomment-1539555148
I test in my local M1, seems there is no regression in q16 (which page index
prune none data). Set `DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true
target/release/datafusion-cli`
But i found
```
without page Index
ParquetExec: file_groups={10 groups:
[[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]},
projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000,
elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0,
bytes_scanned=2657280, row_groups_pruned=0, page_index_rows_filtered=0,
predicate_ev
aluation_errors=0, num_predicate_creation_errors=0, pushdown_rows_filtered=0,
time_elapsed_scanning_until_data=25.893623ms, page_index_eval_time=20ns,
time_elapsed_opening=117.212334ms, pushdown_eval_time=20ns,
time_elapsed_scanning_total=250.615706ms, time_elapsed_processing=128.193753ms]
With page index
ParquetExec: file_groups={10 groups:
[[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]},
projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000,
elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0,
row_groups_pruned=0, bytes_scanned=2903390, num_predicate_creation_errors=0,
predica
te_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0,
time_elapsed_processing=133.898828ms, pushdown_eval_time=20ns,
time_elapsed_scanning_until_data=17.617834ms, page_index_eval_time=20ns,
time_elapsed_opening=88.243249ms, time_elapsed_scanning_total=238.237217ms]
|
```
Form the `bytes_scanned` you can see second one scan more bytes (reasonable
with page Index).
So IMOP , you do the test in `google cloud machine` which means will have
more latency when fetch bytes than local machine.
From the plan
```
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=18314,
elapsed_compute=25.292µs, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | ProjectionExec: expr=[group_alias_0@0 as
part.p_brand, group_alias_1@1 as part.p_type, group_alias_2@2 as part.p_size,
COUNT(alias1)@3 as supplier_cnt], metrics=[output_rows=18314,
elapsed_compute=29.417µs, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1,
group_alias_2@2 as group_alias_2], aggr=[COUNT(alias1)],
metrics=[output_rows=18314, elapsed_compute=170.210247ms, spill_count=0,
spilled_bytes=0, mem_used=0]
|
| | CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=85318, elapsed_compute=9.629845ms, spill_count=0,
spilled_bytes=0, mem_used=0]
|
| | RepartitionExec: partitioning=Hash([Column {
name: "group_alias_0", index: 0 }, Column { name: "group_alias_1", index: 1 },
Column { name: "group_alias_2", index: 2 }], 10), input_partitions=10,
metrics=[fetch_time=4.378393626s, repart_time=56.449415ms, send_time=729.375µs]
|
| | AggregateExec: mode=Partial,
gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1,
group_alias_2@2 as group_alias_2], aggr=[COUNT(alias1)],
metrics=[output_rows=85318, elapsed_compute=398.147584ms, spill_count=0,
spilled_bytes=0, mem_used=0]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1,
group_alias_2@2 as group_alias_2, alias1@3 as alias1], aggr=[],
metrics=[output_rows=118250, elapsed_compute=370.184626ms, spill_count=0,
spilled_bytes=0, mem_used=0]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=118250,
elapsed_compute=14.044362ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | RepartitionExec:
partitioning=Hash([Column { name: "group_alias_0", index: 0 }, Column { name:
"group_alias_1", index: 1 }, Column { name: "group_alias_2", index: 2 }, Column
{ name: "alias1", index: 3 }], 10), input_partitions=10,
metrics=[fetch_time=3.459734628s, repart_time=202.497167ms,
send_time=106.304383ms]
|
| | AggregateExec: mode=Partial,
gby=[p_brand@1 as group_alias_0, p_type@2 as group_alias_1, p_size@3 as
group_alias_2, ps_suppkey@0 as alias1], aggr=[], metrics=[output_rows=118250,
elapsed_compute=387.699496ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=118274, elapsed_compute=16.411µs,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | HashJoinExec: mode=Partitioned,
join_type=LeftAnti, on=[(Column { name: "ps_suppkey", index: 0 }, Column {
name: "s_suppkey", index: 0 })], metrics=[output_rows=118278,
build_input_rows=118324, output_batches=13, input_rows=118278,
input_batches=13, build_input_batches=20, build_mem_used=11183594,
join_time=36.18104ms, build_time=71.978712ms]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=118324,
elapsed_compute=16.599522ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | RepartitionExec:
partitioning=Hash([Column { name: "ps_suppkey", index: 0 }], 10),
input_partitions=10, metrics=[fetch_time=2.856884712s, repart_time=44.825252ms,
send_time=647.955µs]
|
| | ProjectionExec:
expr=[ps_suppkey@1 as ps_suppkey, p_brand@3 as p_brand, p_type@4 as p_type,
p_size@5 as p_size], metrics=[output_rows=118324, elapsed_compute=39.667µs,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=118324, elapsed_compute=47.91µs,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | HashJoinExec:
mode=Partitioned, join_type=Inner, on=[(Column { name: "ps_partkey", index: 0
}, Column { name: "p_partkey", index: 0 })], metrics=[output_rows=29581,
build_input_rows=800000, output_batches=10, input_rows=29581, input_batches=10,
build_input_batches=100, build_mem_used=56076192, join_time=84.478875ms,
build_time=281.910325ms]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=800000,
elapsed_compute=7.032531ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | RepartitionExec:
partitioning=Hash([Column { name: "ps_partkey", index: 0 }], 10),
input_partitions=10, metrics=[fetch_time=213.551508ms,
repart_time=111.779256ms, send_time=2.88029ms]
|
| | ParquetExec:
file_groups={10 groups:
[[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565],
[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]},
projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000,
elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0,
row_groups_pruned=0, bytes
_scanned=2903390, num_predicate_creation_errors=0,
predicate_evaluation_errors=0, page_index_rows_filtered=0,
pushdown_rows_filtered=0, time_elapsed_processing=133.898828ms,
pushdown_eval_time=20ns, time_elapsed_scanning_until_data=17.617834ms,
page_index_eval_time=20ns, time_elapsed_opening=88.243249ms,
time_elapsed_scanning_total=238.237217ms]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=29581, elapsed_compute=4.735922ms,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | RepartitionExec:
partitioning=Hash([Column { name: "p_partkey", index: 0 }], 10),
input_partitions=10, metrics=[fetch_time=1.742820835s,
repart_time=362.862215ms, send_time=352.200595ms]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(10), input_partitions=1,
metrics=[fetch_time=178.274085ms, repart_time=1ns, send_time=14.125µs]
|
| |
CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=29581,
elapsed_compute=2.310501ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | FilterExec:
p_brand@1 != Brand#45 AND p_type@2 NOT LIKE MEDIUM POLISHED% AND Use p_size@3
IN (SET) ([Literal { value: Int32(49) }, Literal { value: Int32(14) }, Literal
{ value: Int32(23) }, Literal { value: Int32(45) }, Literal { value: Int32(19)
}, Literal { value: Int32(3) }, Literal { value: Int32(36) }, Literal { value:
Int32(9) }]), metrics=[output_rows=29581, elapsed_compute=57.569834ms,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| |
ParquetExec: file_groups={1 group:
[[Users/yangjiang/tpch-parquet/part/part-0.parquet]]}, projection=[p_partkey,
p_brand, p_type, p_size], predicate=p_brand@3 != Brand#45 AND p_type@4 NOT LIKE
MEDIUM POLISHED% AND Use p_size@5 IN (SET) ([Literal { value: Int32(49) },
Literal { value: Int32(14) }, Literal { value: Int32(23) }, Literal { value:
Int32(45) }, Literal { value: Int32(19) }, Literal { value: Int32(3) }, Literal
{ value: Int32(36) }, Literal { value: Int32(9) }]),
pruning_predicate=(p_brand_min@0 != Brand#45 OR Brand#45 != p_brand_max@1) AND
(p_size_min@2 <= 49 AND 49 <= p_size_max@3 OR p_size_min@2 <= 14 AND 14 <=
p_size_max@3 OR p_size_min@2 <= 23 AND 23 <= p_size_max@3 OR p_size_min@2 <= 45
AND 45 <= p_size_max@3 OR p_size_min@2 <= 19 AND 19 <= p_size_max@3 OR
p_size_min@2 <= 3 AND 3 <= p_size_max@3 OR p_size_min@2 <= 36 AND 36 <=
p_size_max@3 OR p_size_min@2 <= 9 AND 9 <= p_size_max@3), metrics=[output_
rows=200000, elapsed_compute=1ns, spill_count=0, spilled_bytes=0, mem_used=0,
row_groups_pruned=0, bytes_scanned=935235, num_predicate_creation_errors=0,
predicate_evaluation_errors=0, page_index_rows_filtered=0,
pushdown_rows_filtered=0, time_elapsed_processing=114.579335ms,
pushdown_eval_time=2ns, time_elapsed_scanning_until_data=20.304792ms,
page_index_eval_time=184.001µs, time_elapsed_opening=3.196208ms,
time_elapsed_scanning_total=179.528835ms] |
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=4, elapsed_compute=16.377µs,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | RepartitionExec:
partitioning=Hash([Column { name: "s_suppkey", index: 0 }], 10),
input_partitions=10, metrics=[fetch_time=377.902418ms, repart_time=29.384µs,
send_time=3.968µs]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(10), input_partitions=1,
metrics=[fetch_time=37.565626ms, repart_time=1ns, send_time=3.667µs]
|
| | ProjectionExec:
expr=[s_suppkey@0 as s_suppkey], metrics=[output_rows=4,
elapsed_compute=8.25µs, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | CoalesceBatchesExec:
target_batch_size=8192, metrics=[output_rows=4, elapsed_compute=43.582µs,
spill_count=0, spilled_bytes=0, mem_used=0]
|
| | FilterExec:
s_comment@1 LIKE %Customer%Complaints%, metrics=[output_rows=4,
elapsed_compute=21.399292ms, spill_count=0, spilled_bytes=0, mem_used=0]
|
| | ParquetExec:
file_groups={1 group:
[[Users/yangjiang/tpch-parquet/supplier/part-0.parquet]]},
projection=[s_suppkey, s_comment], predicate=s_comment@6 LIKE
%Customer%Complaints%, metrics=[output_rows=10000, elapsed_compute=1ns,
spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned=0,
bytes_scanned=221440, num_predicate_creation_errors=0,
predicate_evaluation_errors=0, page_index_rows_filtered=0,
pushdown_rows_filtered=0, time_elapsed_processing=12.006207ms,
pushdown_eval_time=2ns, time_elapsed_scanning_until_data=12.450667ms,
page_index_eval_time=251ns, time_elapsed_opening=2.942125ms,
time_elapsed_scanning_total=34.518458ms]
|
| |
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.534 seconds.
```
I found there is one place need improvement, which read ` ParquetExec:
file_groups={10 groups: [[Users/yangjiang/tpch-parquet/partsup` **without
filter pushdown still read the pageIndex bytes** 🤣
So i want to improve this and retest this in cloud env, @alamb Is this
reasonable 🤔 PATL
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
