Ted-Jiang commented on PR #3780:
URL: 
https://github.com/apache/arrow-datafusion/pull/3780#issuecomment-1274556186

   I run the tpch-q1  modify the filter  ">=" to  "<=" for skipping more pages
   ```
   yangjiang@LM-SHC-15009782 datafusion-cli % cargo run --release --bin 
datafusion-cli
       Finished release [optimized] target(s) in 1.14s
        Running `target/release/datafusion-cli`
   DataFusion CLI v13.0.0
   ❯ create external table lineitem stored as parquet location 
'/Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem_order';
   0 rows in set. Query took 0.004 seconds.
   ❯ select
       l_returnflag,
       l_linestatus,
       sum(l_quantity) as sum_qty,
       sum(l_extendedprice) as sum_base_price,
       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
       avg(l_quantity) as avg_qty,
       avg(l_extendedprice) as avg_price,
       avg(l_discount) as avg_disc,
       count(*) as count_order
   from
       lineitem
   where
           l_shipdate <= date '1996-09-02'
   group by
       l_returnflag,
       l_linestatus
   order by
       l_returnflag,
       l_linestatus;
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
   | l_returnflag | l_linestatus | sum_qty  | sum_base_price     | 
sum_disc_price     | sum_charge         | avg_qty            | avg_price        
  | avg_disc            | count_order |
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
   | A            | F            | 37734107 | 56586554400.729996 | 
53758257134.87001  | 55909065222.8277   | 25.522005853257337 | 
38273.12973462167  | 0.04998529583839799 | 1478493     |
   | N            | F            | 991417   | 1487504710.379999  | 
1413082168.0540993 | 1469649223.194375  | 25.516471920522985 | 
38284.467760848274 | 0.05009342667421619 | 38854       |
   | N            | O            | 28196030 | 42296870106.47     | 
40182807989.9758   | 41790214897.953186 | 25.51279800791189  | 
38271.75327140934  | 0.050005284245347   | 1105172     |
   | R            | F            | 37719753 | 56568041380.90001  | 
53741292684.604    | 55889619119.831924 | 25.50579361269077  | 
38250.854626099666 | 0.05000940583012741 | 1478870     |
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+-------------+
   4 rows in set. Query took 0.245 seconds.
   ```
   then set `with_page_index`
   ```
   select
       l_returnflag,
       l_linestatus,
       sum(l_quantity) as sum_qty,
       sum(l_extendedprice) as sum_base_price,
       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
       avg(l_quantity) as avg_qty,
       avg(l_extendedprice) as avg_price,
       avg(l_discount) as avg_disc,
       count(*) as count_order
   from
       lineitem
   where
           l_shipdate <= date '1996-09-02'
   group by
       l_returnflag,
       l_linestatus
   order by
       l_returnflag,
       l_linestatus;
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
   | l_returnflag | l_linestatus | sum_qty  | sum_base_price     | 
sum_disc_price     | sum_charge         | avg_qty            | avg_price        
  | avg_disc             | count_order |
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
   | A            | F            | 37734107 | 56586554400.73     | 
53758257134.869995 | 55909065222.82769  | 25.522005853257337 | 
38273.129734621674 | 0.04998529583839799  | 1478493     |
   | N            | F            | 991417   | 1487504710.3799992 | 
1413082168.0540993 | 1469649223.1943748 | 25.516471920522985 | 
38284.46776084828  | 0.05009342667421619  | 38854       |
   | N            | O            | 28196030 | 42296870106.47002  | 
40182807989.97579  | 41790214897.95318  | 25.51279800791189  | 
38271.753271409354 | 0.050005284245347004 | 1105172     |
   | R            | F            | 37719753 | 56568041380.9      | 
53741292684.604004 | 55889619119.831924 | 25.50579361269077  | 
38250.85462609966  | 0.05000940583012743  | 1478870     |
   
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
   4 rows in set. Query took 0.174 seconds.
   ```
   
   And with log=debug got
   ```
   [2022-10-11T11:35:38Z DEBUG datafusion::physical_plan::file_format::parquet] 
   Use filter and page index create RowSelection
    [[RowSelector { row_count: 3820000, skip: false }], 
   [RowSelector { row_count: 300000, skip: false }, 
   RowSelector { row_count: 1900000, skip: true }]]
   
   ```
   
   Seems get ~1.4x Performance gain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to