[PR] Add `arrow_reader_clickbench` benchmark [arrow-rs]

via GitHub Mon, 05 May 2025 10:15:04 -0700


alamb opened a new pull request, #7470:
URL: https://github.com/apache/arrow-rs/pull/7470


   # Which issue does this PR close?
   
   
   - Closes https://github.com/apache/arrow-rs/issues/7460
   - Part of https://github.com/apache/arrow-rs/issues/7456
   
   # Rationale for this change
    
   We are trying to improve the performance of row filter application in the 
Parquet arrow reader and part of that is a benchmark that we can use to guide 
optimization efforts. 
   
   However, as discussed in https://github.com/apache/arrow-rs/pull/7428 the 
`arrow_reader_row_filter` microbenchmark doesn't currently reflect the actual 
performance we see in our end to end application (DataFusion).
   
   ```shell
   cargo bench --all-features --bench arrow_reader_row_filter
   ```
   
   Thus, we think we need to create a benchmark that uses the actual ClickBench 
dataset with appropriate filtering
   
   - See https://github.com/apache/arrow-rs/issues/7460 for more details
   
   # What changes are included in this PR?
   
   1. Adds a new `arrow_reader_clickbench` benchmark
   
   
   
   This benchmark tests applying the actual clickbench filters (and column 
materialization):
   1. Single file and partitioned (100 file) datasets
   2. async and sync readers
   2. All clickbench query patterns
   
   # Are there any user-facing changes?
   
   New benchmark, and hopefully thus improved filter / projection performance
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add `arrow_reader_clickbench` benchmark [arrow-rs]

Reply via email to