Re: Add FilteredPageReader to filter rows based on page statistics

Micah Kornfield Mon, 31 Oct 2022 22:58:51 -0700

Hi Fatemah,
I think there are likely two things to consider here:
1.  How will expressions be modeled?  There are already some examples of
using expressions in Arrow for pruning predicates [1].  Do you plan to
re-use them?
2.  Along these lines is the proposed approach taken because the API to
expose the raw data necessary and filter externally too unwieldy?


Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/5e49174d69deb9d1cbbdf82bc8041b90098f560b/cpp/src/arrow/dataset/file_parquet.cc

On Mon, Oct 31, 2022 at 9:50 AM Fatemah Panahi <[email protected]>
wrote:

> -- Sending as an email in case Jira messages are filtered out. Please let
> me know your thoughts on this. Thanks!
>
> Jira ticket: https://issues.apache.org/jira/browse/PARQUET-2210
>
> Currently, we do not use the statistics that is stored in the page headers
> for pruning the rows that we read. Row group pruning is very coarse-grained
> and in many cases does not prune the row group. I propose adding a
> FilteredPageReader that would accept a filter and would not return the
> pages that do not match the filter based on page statistics.
>
> Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
>
> Also, the FilteredPageReader will keep track of what row ranges matched and
> not matched. We could use this to skip reading rows that do not match from
> the rest of the columns. Note that the SkipRecords API is being added to
> the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)
>

Re: Add FilteredPageReader to filter rows based on page statistics

Reply via email to