[ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fatemah updated PARQUET-2210:
-----------------------------
    Description: Currently, we do not expose the page header metadata and they 
cannot be used for skipping pages. I propose exposing the metadata through a 
callback that would allow the caller to decide if they want to read or skip the 
page based on the metadata. The signature of the callback would be the 
following: std::function<bool(const format::PageHeader&)> skip_page_callback)  
(was: Currently, we do not use the statistics that is stored in the page 
headers for pruning the rows that we read. Row group pruning is very 
coarse-grained and in many cases does not prune the row group. I propose adding 
a FilteredPageReader that would accept a filter and would not return the pages 
that do not match the filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and not 
matched. We could use this to skip reading rows that do not match from the rest 
of the columns. Note that the SkipRecords API was recently added to the Parquet 
reader (https://issues.apache.org/jira/browse/PARQUET-2188))

> Skip pages based on header metadata using a callback
> ----------------------------------------------------
>
>                 Key: PARQUET-2210
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2210
>             Project: Parquet
>          Issue Type: New Feature
>            Reporter: fatemah
>            Priority: Major
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function<bool(const format::PageHeader&)> skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to