Re: Add FilteredPageReader to filter rows based on page statistics

Fatemah Panahi Wed, 02 Nov 2022 11:19:17 -0700

What do we think about introducing a dependency on
arrow::compute::Expression for specifying and evaluating the filter? Is
that acceptable?


On Tue, Nov 1, 2022 at 10:42 AM Fatemah Panahi <[email protected]> wrote:

> Hi Micah,
>
> Answers inline.
>
> Another item that we need to think about is whether we want to
> guarantee that all rows that are returned are matching? The stronger
> guarantee means that we need to check each and every value, rather than
> just relying on the min/max in the page headers, but relieves the caller
> from re-filtering the values. I am inclining towards a stronger guarantee.
>
> On Mon, Oct 31, 2022 at 10:58 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Fatemah,
>> I think there are likely two things to consider here:
>> 1.  How will expressions be modeled?  There are already some examples of
>> using expressions in Arrow for pruning predicates [1].  Do you plan to
>> re-use them?
>>
> Thanks for pointing this out. Yes, I will consider re-using the
> arrow::compute::Expression for specifying the filter.
>
>
>> 2.  Along these lines is the proposed approach taken because the API to
>> expose the raw data necessary and filter externally too unwieldy?
>>
> I think it makes sense for parquet readers to support the filtering
> functionality, rather than each client implementing their own filtering.
> This would allow for more optimizations as well. For example, right now,
> the page header metadata is not exposed, and the caller simply calls
> ReadBatch, without caring which page they are in. To use the min/max in the
> page headers for filtering, we would need to expose this and the read logic
> would be more complicated.
>
>>
>> Thanks,
>> Micah
>>
>>
>> [1]
>>
>> https://github.com/apache/arrow/blob/5e49174d69deb9d1cbbdf82bc8041b90098f560b/cpp/src/arrow/dataset/file_parquet.cc
>>
>> On Mon, Oct 31, 2022 at 9:50 AM Fatemah Panahi <[email protected]
>> >
>> wrote:
>>
>> > -- Sending as an email in case Jira messages are filtered out. Please
>> let
>> > me know your thoughts on this. Thanks!
>> >
>> > Jira ticket: https://issues.apache.org/jira/browse/PARQUET-2210
>> >
>> > Currently, we do not use the statistics that is stored in the page
>> headers
>> > for pruning the rows that we read. Row group pruning is very
>> coarse-grained
>> > and in many cases does not prune the row group. I propose adding a
>> > FilteredPageReader that would accept a filter and would not return the
>> > pages that do not match the filter based on page statistics.
>> >
>> > Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
>> >
>> > Also, the FilteredPageReader will keep track of what row ranges matched
>> and
>> > not matched. We could use this to skip reading rows that do not match
>> from
>> > the rest of the columns. Note that the SkipRecords API is being added to
>> > the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)
>> >
>>
>

Re: Add FilteredPageReader to filter rows based on page statistics

Reply via email to