[ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fatemah updated PARQUET-2210:
-----------------------------
    Summary: Skip pages based on header metadata using a callback  (was: Add 
FilteredPageReader to filter rows based on page statistics)

> Skip pages based on header metadata using a callback
> ----------------------------------------------------
>
>                 Key: PARQUET-2210
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2210
>             Project: Parquet
>          Issue Type: New Feature
>            Reporter: fatemah
>            Priority: Major
>
> Currently, we do not use the statistics that is stored in the page headers 
> for pruning the rows that we read. Row group pruning is very coarse-grained 
> and in many cases does not prune the row group. I propose adding a 
> FilteredPageReader that would accept a filter and would not return the pages 
> that do not match the filter based on page statistics.
> Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
> Also, the FilteredPageReader will keep track of what row ranges matched and 
> not matched. We could use this to skip reading rows that do not match from 
> the rest of the columns. Note that the SkipRecords API was recently added to 
> the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to