[
https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacques Nadeau updated DRILL-1950:
----------------------------------
Component/s: Storage - Parquet
Fix Version/s: Future
Assignee: Jacques Nadeau (was: Jason Altekruse)
> Implement filter pushdown for Parquet
> -------------------------------------
>
> Key: DRILL-1950
> URL: https://issues.apache.org/jira/browse/DRILL-1950
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Jason Altekruse
> Assignee: Jacques Nadeau
> Fix For: Future
>
>
> The parquet reader currently supports project pushdown, for limiting the
> number of columns read, however it does not use filter pushdown to read a
> subset of the requested columns. This is particularly useful with parquet
> files that contain statistics, most importantly min and max values on pages.
> Evaluating predicates against these values could save some major reading and
> decoding time.
> The largest barrier to implementing this is the current design of the reader.
> Firstly, we currently have two separate parquet readers, one for reading flat
> files very quickly and another or reading complex data. There are
> enhancements we can make the the flat reader, to make it support nested data
> in a much more efficient manner. However the speed of the flat file reader
> currently comes from being able to make vectorized copies out the the parquet
> file. This design is somewhat at odds with filter pushdown, as we will only
> can make useful vectorized copies if the filter matches a large run of values
> within the file. This might not be too rare a case, assuming files are often
> somewhat sorted on a primary field like date or a numeric key, and these are
> often fields used to limit the query to a subset of the data. However for
> cases where we are filter out a few records here and there, we should just
> make individual copies.
> We need to do more design work on the best way to balance performance with
> these use cases in mind.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)