Yordan Pavlov created ARROW-11074:
-------------------------------------

             Summary: [Rust][DataFusion] Implement predicate push-down for 
parquet tables
                 Key: ARROW-11074
                 URL: https://issues.apache.org/jira/browse/ARROW-11074
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust - DataFusion
            Reporter: Yordan Pavlov
            Assignee: Yordan Pavlov


While profiling a DataFusion query I found that the code spends a lot of time 
in reading data from parquet files. Predicate / filter push-down is a commonly 
used performance optimization, where statistics data stored in parquet files 
(such as min / max values for columns in a parquet row group) is evaluated 
against query filters to determine which row groups could contain data 
requested by a query. In this way, by pushing down query filters all the way to 
the parquet data source, entire row groups or even parquet files can be skipped 
often resulting in significant performance improvements.

 

I have been working on an implementation for a few weeks and initial results 
look promising - with predicate push-down, DataFusion is now faster than Apache 
Spark (140ms for DataFusion vs 200ms for Spark) for the same query against the 
same parquet files. And I suspect with the latest improvements to the filter 
kernel, DataFusion performance will be even better.

 

My work is based on the following key ideas:
 * it's best to reuse the existing code for evaluating physical expressions 
already implemented in DataFusion
 * filter expressions pushed down to a parquet table are rewritten to use 
parquet statistics, for example `(column / 2) = 4`  becomes  `(column_min / 2) 
<= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet 
table
 * for each parquet file, a RecordBatch containing all required statistics 
columns is produced, and the predicate expression from the previous step is 
evaluated, producing a binary array which is finally used to filter the row 
groups in each parquet file

Next steps are: integrate this work with latest changes from master branch, 
publish WIP PR, implement more unit tests

[~andygrove] , [~alamb] let me know what you think



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to