[GitHub] [arrow-datafusion] alamb opened a new issue #847: Implement parquet page-level skipping with column index, using min/max stats

GitBox Tue, 10 Aug 2021 04:02:47 -0700


alamb opened a new issue #847:
URL: https://github.com/apache/arrow-datafusion/issues/847



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   (this is summarized version of some comments in a [discussion 
thread](https://the-asf.slack.com/archives/C01QUFS30TD/p1628147053020500) 
between @sunchao @jorgecarleitao and @nju_yaho (not sure if that is the correct 
github handle)
   
   While reading data from parquet files, the more data that can be immediately 
ruled out without decompressing, the faster the query will go
   
   @sunchao pointed out that the structure of Parquet also allows page-level 
skipping with column index, using min/max stats, which is pretty effective when 
data is sorted. The data being sorted is important because otherwise a data 
page could contain random data within a big range [min, max] and predicates 
such as `col < 42` won’t be very effective.
   
   There is a good blog post about this feature: 
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
   
   Note that page level min/max statistics is a relatively new feature. We only 
know of parquet-mr and impala which have implemented it. Spark also recently 
added the support in https://github.com/apache/spark/pull/32753. The page 
indexes are stored in the column chunk metadata: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L798
   
   **Describe the solution you'd like**
   
   To achieve this, latest Parquet version has introduced two kinds of indexes: 
`ColumnIndex` and `OffsetIndex`.
   
   They are all stored in the file footer for each `ColumnChunk` of each 
`RowGroup`.
   
   For `ColumnIndex`, it includes `min`, `max` for each page; while for 
`OffsetIndex`, it includes row ranges, file offset range for each page.
   
   For example, to filter by Column A to achieve filtering on column B
   
   ```
        1. For Column A:
                a. According to the ColumnIndex, filter qualified pages
                b. According to the OffsetIndex, achieve the row ranges for the 
qualified pages
        2. For Column B:
                a. According to the row ranges from Column A and its 
OffsetIndex, find out qualified pages whose row ranges are overlapped
           b. According to the filtered OffsetIndex, read related pages
   ```
   
   In the case Column B above you also need to use row ranges when scanning a 
page, and skip those rows if they are not within the range. In the case of 
multiple predicates on different columns, you’d also need to calculate row 
range intersects or union.
   
   **Describe alternatives you've considered**
   TBD
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #847: Implement parquet page-level skipping with column index, using min/max stats

Reply via email to