Hi Fatemah, I think there are likely two things to consider here: 1. How will expressions be modeled? There are already some examples of using expressions in Arrow for pruning predicates [1]. Do you plan to re-use them? 2. Along these lines is the proposed approach taken because the API to expose the raw data necessary and filter externally too unwieldy?
Thanks, Micah [1] https://github.com/apache/arrow/blob/5e49174d69deb9d1cbbdf82bc8041b90098f560b/cpp/src/arrow/dataset/file_parquet.cc On Mon, Oct 31, 2022 at 9:50 AM Fatemah Panahi <[email protected]> wrote: > -- Sending as an email in case Jira messages are filtered out. Please let > me know your thoughts on this. Thanks! > > Jira ticket: https://issues.apache.org/jira/browse/PARQUET-2210 > > Currently, we do not use the statistics that is stored in the page headers > for pruning the rows that we read. Row group pruning is very coarse-grained > and in many cases does not prune the row group. I propose adding a > FilteredPageReader that would accept a filter and would not return the > pages that do not match the filter based on page statistics. > > Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL. > > Also, the FilteredPageReader will keep track of what row ranges matched and > not matched. We could use this to skip reading rows that do not match from > the rest of the columns. Note that the SkipRecords API is being added to > the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188) >
