[
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
fatemah updated PARQUET-2210:
-----------------------------
Description: Currently, we do not expose the page header metadata and they
cannot be used for skipping pages. I propose exposing the metadata through a
callback that would allow the caller to decide if they want to read or skip the
page based on the metadata. The signature of the callback would be the
following: std::function<bool(const format::PageHeader&)> skip_page_callback)
(was: Currently, we do not use the statistics that is stored in the page
headers for pruning the rows that we read. Row group pruning is very
coarse-grained and in many cases does not prune the row group. I propose adding
a FilteredPageReader that would accept a filter and would not return the pages
that do not match the filter based on page statistics.
Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
Also, the FilteredPageReader will keep track of what row ranges matched and not
matched. We could use this to skip reading rows that do not match from the rest
of the columns. Note that the SkipRecords API was recently added to the Parquet
reader (https://issues.apache.org/jira/browse/PARQUET-2188))
> Skip pages based on header metadata using a callback
> ----------------------------------------------------
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
> Issue Type: New Feature
> Reporter: fatemah
> Priority: Major
>
> Currently, we do not expose the page header metadata and they cannot be used
> for skipping pages. I propose exposing the metadata through a callback that
> would allow the caller to decide if they want to read or skip the page based
> on the metadata. The signature of the callback would be the following:
> std::function<bool(const format::PageHeader&)> skip_page_callback)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)