Ted-Jiang commented on PR #3780:
URL:
https://github.com/apache/arrow-datafusion/pull/3780#issuecomment-1274556186
I run the tpch-q1 modify the filter ">=" to "<=" for skipping more pages
```
yangjiang@LM-SHC-15009782 datafusion-cli % cargo run --release --bin
datafusion-cli
Finished release [optimized] target(s) in 1.14s
Running `target/release/datafusion-cli`
DataFusion CLI v13.0.0
❯ create external table lineitem stored as parquet location
'/Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem_order';
0 rows in set. Query took 0.004 seconds.
❯ select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1996-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price |
sum_disc_price | sum_charge | avg_qty | avg_price
| avg_disc | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
| A | F | 37734107 | 56586554400.729996 |
53758257134.87001 | 55909065222.8277 | 25.522005853257337 |
38273.12973462167 | 0.04998529583839799 | 1478493 |
| N | F | 991417 | 1487504710.379999 |
1413082168.0540993 | 1469649223.194375 | 25.516471920522985 |
38284.467760848274 | 0.05009342667421619 | 38854 |
| N | O | 28196030 | 42296870106.47 |
40182807989.9758 | 41790214897.953186 | 25.51279800791189 |
38271.75327140934 | 0.050005284245347 | 1105172 |
| R | F | 37719753 | 56568041380.90001 |
53741292684.604 | 55889619119.831924 | 25.50579361269077 |
38250.854626099666 | 0.05000940583012741 | 1478870 |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
4 rows in set. Query took 0.245 seconds.
```
then set `with_page_index`
```
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1996-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price |
sum_disc_price | sum_charge | avg_qty | avg_price
| avg_disc | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| A | F | 37734107 | 56586554400.73 |
53758257134.869995 | 55909065222.82769 | 25.522005853257337 |
38273.129734621674 | 0.04998529583839799 | 1478493 |
| N | F | 991417 | 1487504710.3799992 |
1413082168.0540993 | 1469649223.1943748 | 25.516471920522985 |
38284.46776084828 | 0.05009342667421619 | 38854 |
| N | O | 28196030 | 42296870106.47002 |
40182807989.97579 | 41790214897.95318 | 25.51279800791189 |
38271.753271409354 | 0.050005284245347004 | 1105172 |
| R | F | 37719753 | 56568041380.9 |
53741292684.604004 | 55889619119.831924 | 25.50579361269077 |
38250.85462609966 | 0.05000940583012743 | 1478870 |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
4 rows in set. Query took 0.174 seconds.
```
And with log=debug got
```
[2022-10-11T11:35:38Z DEBUG datafusion::physical_plan::file_format::parquet]
Use filter and page index create RowSelection
[[RowSelector { row_count: 3820000, skip: false }],
[RowSelector { row_count: 300000, skip: false },
RowSelector { row_count: 1900000, skip: true }]]
```
Seems get ~1.4x Performance gain.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]