Add FilteredPageReader to filter rows based on page statistics

Fatemah Panahi Mon, 31 Oct 2022 09:50:34 -0700

-- Sending as an email in case Jira messages are filtered out. Please let
me know your thoughts on this. Thanks!


Jira ticket: https://issues.apache.org/jira/browse/PARQUET-2210

Currently, we do not use the statistics that is stored in the page headers
for pruning the rows that we read. Row group pruning is very coarse-grained
and in many cases does not prune the row group. I propose adding a
FilteredPageReader that would accept a filter and would not return the
pages that do not match the filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and
not matched. We could use this to skip reading rows that do not match from
the rest of the columns. Note that the SkipRecords API is being added to
the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)

Add FilteredPageReader to filter rows based on page statistics

Reply via email to