Re: [I] Parquet decoder / decoded page Cache [arrow-rs]

via GitHub Thu, 10 Apr 2025 03:29:13 -0700


alamb commented on issue #7363:
URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2792290735

Thanks @zhuqi-lucas !

Is there any way you could look into creating a benchmark for evaluating
filters? I can do so too if you prefer

The idea is to create a benchmarks for evaluating row filters (what
@XiangpengHao is trying to optimize) that captures the common use case and is
what we are trying to optimize in DataFusion

You add a row filter with this API:

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter

I suggest a benchmarks that
1. Writes a parquet file with 100K rows and four columns (int64, float64,
Utf8View, and Timestamp) into memory
2. Adds filters + projections
3. Benchmarks how fast it is to read the data back

For the filters, it is important to capture both selective filters (that
select a small number of contiguous ranges) as well as non selective filters
(that select rows that are scattered througout the data). Here are suggestions

Filters:
1. A string filter like `col <> ''` that selects about 1/2 of the data
2. String Filter like `col = 'const'` that is selective and selects only a
few rows
3. Integer filter like `col = ''` (both selective and non selective)
4. Timestamp filter like 'ts > time'

For the projections, it is important to capture both when the column with
the predicate appears at the output as well as when the column wihtout the
predicate appears. Here are suggestions

Projections (which columns are selected out):
1. All 4 columns
2. Some column other than the filter column

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Parquet decoder / decoded page Cache [arrow-rs]

Reply via email to