viirya opened a new pull request, #2558: URL: https://github.com/apache/iceberg-rust/pull/2558
## Which issue does this PR close? - Closes #2557. ## What changes are included in this PR? Adds a criterion-based benchmark harness for the `ArrowReader` at `crates/iceberg/benches/arrow_reader.rs`. Until now there were no in-repo reader benchmarks, so every performance claim in the perf epic (#2172) had to be validated against external workloads. This harness writes Parquet files to a local temp dir and reads them back through the normal `FileIO` path, measuring the per-`FileScanTask` overhead that dominates scans of tables with many small files. Because it runs on the local FS, it isolates CPU / per-task work rather than network latency. Benchmark groups, chosen to map onto the epics code paths: - **many_small_files** — scans 16/64/256 small files, reporting files/sec so per-file overhead is directly visible. - **concurrency** — a fixed corpus read at concurrency 1/4/16, exercising both the single-concurrency fast path and the buffered/flattened multi-task path. - **migrated_table** — files without embedded field IDs read via name mapping, isolating the migrated-table schema-resolution cost (#2176 path). - **same_file_splits** — one multi-row-group file read as 1/8/32 byte-range tasks, surfacing the redundant per-split metadata fetch that metadata caching (item #5, #2100) targets. - **with_predicate** — scans carrying a bound predicate with row-group filtering and row selection enabled, exercising the per-task row-filter setup. This adds `criterion` (with the `async_tokio` feature) as a workspace dev-dependency and a `[[bench]]` entry to the `iceberg` crate. It gives a reproducible baseline for evaluating the remaining #2172 optimizations such as operator caching (#2177) and metadata reuse. For example, `same_file_splits` shows reading one file as 32 byte-range tasks taking several times longer than reading it once, because each split re-fetches the Parquet metadata independently — exactly the cost item #5 targets. Run with: ``` cargo bench -p iceberg --bench arrow_reader ``` ## Are these changes tested? This is a benchmark, not a code change. The harness compiles and runs (`cargo bench -p iceberg --bench arrow_reader`), and reuses the same Parquet-writing and reader-driving patterns as the existing reader unit tests. `cargo fmt` and `cargo clippy -p iceberg --benches` are clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
