[
https://issues.apache.org/jira/browse/ARROW-11074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Lamb resolved ARROW-11074.
---------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 9064
[https://github.com/apache/arrow/pull/9064]
> [Rust][DataFusion] Implement predicate push-down for parquet tables
> -------------------------------------------------------------------
>
> Key: ARROW-11074
> URL: https://issues.apache.org/jira/browse/ARROW-11074
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Yordan Pavlov
> Assignee: Yordan Pavlov
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 8h 40m
> Remaining Estimate: 0h
>
> While profiling a DataFusion query I found that the code spends a lot of time
> in reading data from parquet files. Predicate / filter push-down is a
> commonly used performance optimization, where statistics data stored in
> parquet files (such as min / max values for columns in a parquet row group)
> is evaluated against query filters to determine which row groups could
> contain data requested by a query. In this way, by pushing down query filters
> all the way to the parquet data source, entire row groups or even parquet
> files can be skipped often resulting in significant performance improvements.
>
> I have been working on an implementation for a few weeks and initial results
> look promising - with predicate push-down, DataFusion is now faster than
> Apache Spark (140ms for DataFusion vs 200ms for Spark) for the same query
> against the same parquet files. And I suspect with the latest improvements to
> the filter kernel, DataFusion performance will be even better.
>
> My work is based on the following key ideas:
> * it's best to reuse the existing code for evaluating physical expressions
> already implemented in DataFusion
> * filter expressions pushed down to a parquet table are rewritten to use
> parquet statistics, for example `(column / 2) = 4` becomes `(column_min /
> 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a
> parquet table
> * for each parquet file, a RecordBatch containing all required statistics
> columns is produced, and the predicate expression from the previous step is
> evaluated, producing a binary array which is finally used to filter the row
> groups in each parquet file
> Next steps are: integrate this work with latest changes from master branch,
> publish WIP PR, implement more unit tests
> [~andygrove] , [~alamb] let me know what you think
--
This message was sent by Atlassian Jira
(v8.3.4#803005)