[
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
fatemah updated PARQUET-2210:
-----------------------------
Summary: Skip pages based on header metadata using a callback (was: Add
FilteredPageReader to filter rows based on page statistics)
> Skip pages based on header metadata using a callback
> ----------------------------------------------------
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
> Issue Type: New Feature
> Reporter: fatemah
> Priority: Major
>
> Currently, we do not use the statistics that is stored in the page headers
> for pruning the rows that we read. Row group pruning is very coarse-grained
> and in many cases does not prune the row group. I propose adding a
> FilteredPageReader that would accept a filter and would not return the pages
> that do not match the filter based on page statistics.
> Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
> Also, the FilteredPageReader will keep track of what row ranges matched and
> not matched. We could use this to skip reading rows that do not match from
> the rest of the columns. Note that the SkipRecords API was recently added to
> the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)