viirya opened a new issue, #2557: URL: https://github.com/apache/iceberg-rust/issues/2557
Is your feature request related to a problem or challenge? The `ArrowReader` has been the subject of a dedicated performance epic (#2172), and several optimizations have already landed from it (file-size passthrough #2175, single metadata load for migrated tables #2176, metadata size hint #2173, range coalescing #2181), with more proposed (operator caching #2177, same-file metadata caching — #2100, closed). The problem is that **there is currently no benchmark for the reader anywhere in the repo** — no `benches/`, no criterion harness. Every performance claim in #2172 was measured against an external DataFusion Comet workload. That makes it hard for contributors and reviewers to: - reproduce the per-`FileScanTask` overhead the epic describes, - evaluate whether a proposed optimization actually helps, and on which scenario, - guard against regressions. This gap had a concrete cost: #2100 (same-file metadata caching) was closed partly because the author could not demonstrate a benefit on their particular workload (a table with a ~1:1 task-to-file ratio, where same-file caching has nothing to hit). With a reproducible same-file-split benchmark in the repo, that kind of optimization could be evaluated objectively. Describe the solution youd like A criterion-based benchmark harness (`crates/iceberg/benches/arrow_reader.rs`) that writes Parquet files to a local temp dir and reads them back through the normal `FileIO` path, measuring per-task overhead rather than network latency. Proposed scenarios, chosen to map onto the epics code paths: - **many_small_files** — scans of 16/64/256 small files; per-file overhead in files/sec. - **concurrency** — a fixed corpus at concurrency 1/4/16 (single-concurrency fast path vs buffered/flattened path). - **migrated_table** — files without embedded field IDs, read via name mapping (the #2176 path). - **same_file_splits** — one multi-row-group file read as 1/8/32 byte-range tasks (the #2100 / item-5 path). - **with_predicate** — scans with a bound predicate, row-group filtering and row selection enabled. These run on the local FS, so they isolate CPU and per-task work. They are not a substitute for object-store latency benchmarks, but they give a reproducible baseline that any of the remaining #2172 optimizations can be measured against. Willingness to contribute I have a branch ready and will open a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
