[I] [EPIC] Faster performance for parquet predicate evaluation for non selective filters [arrow-rs]

via GitHub Tue, 29 Apr 2025 08:05:12 -0700


alamb opened a new issue, #7456:
URL: https://github.com/apache/arrow-rs/issues/7456

## Is your feature request related to a problem or challenge? Please
describe what you are trying to do.

- Related to https://github.com/apache/datafusion/issues/3463 in DataFusion.

When evaluating filters on data stored in parquet, you can:
1. Use the [`with_row_filter`] API to apply predicates during the scan
2. Read the data and apply the predicate using the [`filter`] kernel
afterwards

Currently, it is faster to use [`with_row_filter`] for some predicates and
[`filter`] for others. In DataFusion we have a configuration setting to choose
between the strategies (`filter_pushdown`, see
https://github.com/apache/datafusion/issues/3463) but that is a bad UX as it
means the user must somehow know which strategy to choose, but the strategy
changes

In general the queries that are slower when [`with_row_filter`] is used:
1. The predicates are not very selective (e.g. they pass more than 1% of the
rows)
2. The filters are applied to columns which are also used in the query
result (e.g. the a filter column is also in the projection)

### More Background:

The predicates are provides as a [`RowFilter`] (see docs for more details)

>
[RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html)
applies predicates in order, after decoding only the columns required. As
predicates eliminate rows, fewer rows from subsequent columns may be required,
thus potentially reducing IO and decode.

[`with_row_filter`]:
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter
[`filter`]:
https://docs.rs/arrow/latest/arrow/compute/kernels/filter/index.html
[`RowFilter`]:
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html

## Describe the solution you'd like

I would like the evaluation of predicates in `RowFilter` (aka pushed down
predicates) to never be worse than decoding the columns first and then
filtering them with the `filter` kernel

We have added a benchmark https://github.com/apache/arrow-rs/pull/7401,
which hopefully can

```shell
cargo bench --all-features --bench arrow_reader_row_filter
```

**Describe alternatives you've considered**
This goal will likely require several changes to the codebase. Here are some
options:
- [ ] https://github.com/apache/arrow-rs/pull/7401
- [ ] https://github.com/apache/arrow-rs/issues/5523
- [ ] https://github.com/apache/arrow-rs/issues/7363

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [EPIC] Faster performance for parquet predicate evaluation for non selective filters [arrow-rs]

Reply via email to