Jason Altekruse created DRILL-1950:
--------------------------------------

             Summary: Implement filter pushdown for Parquet
                 Key: DRILL-1950
                 URL: https://issues.apache.org/jira/browse/DRILL-1950
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Jason Altekruse
            Assignee: Jason Altekruse


The parquet reader currently supports project pushdown, for limiting the number 
of columns read, however it does not use filter pushdown to read a subset of 
the requested columns. This is particularly useful with parquet files that 
contain statistics, most importantly min and max values on pages. Evaluating 
predicates against these values could save some major reading and decoding time.

The largest barrier to implementing this is the current design of the reader. 
Firstly, we currently have two separate parquet readers, one for reading flat 
files very quickly and another or reading complex data. There are enhancements 
we can make the the flat reader, to make it support nested data in a much more 
efficient manner. However the speed of the flat file reader currently comes 
from being able to make vectorized copies out the the parquet file. This design 
is somewhat at odds with filter pushdown, as we will only can make useful 
vectorized copies if the filter matches a large run of values within the file. 
This might not be too rare a case, assuming files are often somewhat sorted on 
a primary field like date or a numeric key, and these are often fields used to 
limit the query to a subset of the data. However for cases where we are filter 
out a few records here and there, we should just make individual copies.

We need to do more design work on the best way to balance performance with 
these use cases in mind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to