[ 
https://issues.apache.org/jira/browse/ARROW-11074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11074.
---------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 9064
[https://github.com/apache/arrow/pull/9064]

> [Rust][DataFusion] Implement predicate push-down for parquet tables
> -------------------------------------------------------------------
>
>                 Key: ARROW-11074
>                 URL: https://issues.apache.org/jira/browse/ARROW-11074
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Yordan Pavlov
>            Assignee: Yordan Pavlov
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> While profiling a DataFusion query I found that the code spends a lot of time 
> in reading data from parquet files. Predicate / filter push-down is a 
> commonly used performance optimization, where statistics data stored in 
> parquet files (such as min / max values for columns in a parquet row group) 
> is evaluated against query filters to determine which row groups could 
> contain data requested by a query. In this way, by pushing down query filters 
> all the way to the parquet data source, entire row groups or even parquet 
> files can be skipped often resulting in significant performance improvements.
>  
> I have been working on an implementation for a few weeks and initial results 
> look promising - with predicate push-down, DataFusion is now faster than 
> Apache Spark (140ms for DataFusion vs 200ms for Spark) for the same query 
> against the same parquet files. And I suspect with the latest improvements to 
> the filter kernel, DataFusion performance will be even better.
>  
> My work is based on the following key ideas:
>  * it's best to reuse the existing code for evaluating physical expressions 
> already implemented in DataFusion
>  * filter expressions pushed down to a parquet table are rewritten to use 
> parquet statistics, for example `(column / 2) = 4`  becomes  `(column_min / 
> 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a 
> parquet table
>  * for each parquet file, a RecordBatch containing all required statistics 
> columns is produced, and the predicate expression from the previous step is 
> evaluated, producing a binary array which is finally used to filter the row 
> groups in each parquet file
> Next steps are: integrate this work with latest changes from master branch, 
> publish WIP PR, implement more unit tests
> [~andygrove] , [~alamb] let me know what you think



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to