[PR] bench(reader): add ArrowReader benchmark harness [iceberg-rust]

via GitHub Sun, 31 May 2026 10:33:08 -0700


viirya opened a new pull request, #2558:
URL: https://github.com/apache/iceberg-rust/pull/2558


   ## Which issue does this PR close?
   
   - Closes #2557.
   
   ## What changes are included in this PR?
   
   Adds a criterion-based benchmark harness for the `ArrowReader` at 
`crates/iceberg/benches/arrow_reader.rs`. Until now there were no in-repo 
reader benchmarks, so every performance claim in the perf epic (#2172) had to 
be validated against external workloads. This harness writes Parquet files to a 
local temp dir and reads them back through the normal `FileIO` path, measuring 
the per-`FileScanTask` overhead that dominates scans of tables with many small 
files. Because it runs on the local FS, it isolates CPU / per-task work rather 
than network latency.
   
   Benchmark groups, chosen to map onto the epics code paths:
   
   - **many_small_files** — scans 16/64/256 small files, reporting files/sec so 
per-file overhead is directly visible.
   - **concurrency** — a fixed corpus read at concurrency 1/4/16, exercising 
both the single-concurrency fast path and the buffered/flattened multi-task 
path.
   - **migrated_table** — files without embedded field IDs read via name 
mapping, isolating the migrated-table schema-resolution cost (#2176 path).
   - **same_file_splits** — one multi-row-group file read as 1/8/32 byte-range 
tasks, surfacing the redundant per-split metadata fetch that metadata caching 
(item #5, #2100) targets.
   - **with_predicate** — scans carrying a bound predicate with row-group 
filtering and row selection enabled, exercising the per-task row-filter setup.
   
   This adds `criterion` (with the `async_tokio` feature) as a workspace 
dev-dependency and a `[[bench]]` entry to the `iceberg` crate.
   
   It gives a reproducible baseline for evaluating the remaining #2172 
optimizations such as operator caching (#2177) and metadata reuse. For example, 
`same_file_splits` shows reading one file as 32 byte-range tasks taking several 
times longer than reading it once, because each split re-fetches the Parquet 
metadata independently — exactly the cost item #5 targets.
   
   Run with:
   
   ```
   cargo bench -p iceberg --bench arrow_reader
   ```
   
   ## Are these changes tested?
   
   This is a benchmark, not a code change. The harness compiles and runs 
(`cargo bench -p iceberg --bench arrow_reader`), and reuses the same 
Parquet-writing and reader-driving patterns as the existing reader unit tests. 
`cargo fmt` and `cargo clippy -p iceberg --benches` are clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] bench(reader): add ArrowReader benchmark harness [iceberg-rust]

Reply via email to