viirya opened a new issue, #2557:
URL: https://github.com/apache/iceberg-rust/issues/2557

   Is your feature request related to a problem or challenge?
   
   The `ArrowReader` has been the subject of a dedicated performance epic 
(#2172), and several optimizations have already landed from it (file-size 
passthrough #2175, single metadata load for migrated tables #2176, metadata 
size hint #2173, range coalescing #2181), with more proposed (operator caching 
#2177, same-file metadata caching — #2100, closed).
   
   The problem is that **there is currently no benchmark for the reader 
anywhere in the repo** — no `benches/`, no criterion harness. Every performance 
claim in #2172 was measured against an external DataFusion Comet workload. That 
makes it hard for contributors and reviewers to:
   
   - reproduce the per-`FileScanTask` overhead the epic describes,
   - evaluate whether a proposed optimization actually helps, and on which 
scenario,
   - guard against regressions.
   
   This gap had a concrete cost: #2100 (same-file metadata caching) was closed 
partly because the author could not demonstrate a benefit on their particular 
workload (a table with a ~1:1 task-to-file ratio, where same-file caching has 
nothing to hit). With a reproducible same-file-split benchmark in the repo, 
that kind of optimization could be evaluated objectively.
   
   Describe the solution youd like
   
   A criterion-based benchmark harness 
(`crates/iceberg/benches/arrow_reader.rs`) that writes Parquet files to a local 
temp dir and reads them back through the normal `FileIO` path, measuring 
per-task overhead rather than network latency. Proposed scenarios, chosen to 
map onto the epics code paths:
   
   - **many_small_files** — scans of 16/64/256 small files; per-file overhead 
in files/sec.
   - **concurrency** — a fixed corpus at concurrency 1/4/16 (single-concurrency 
fast path vs buffered/flattened path).
   - **migrated_table** — files without embedded field IDs, read via name 
mapping (the #2176 path).
   - **same_file_splits** — one multi-row-group file read as 1/8/32 byte-range 
tasks (the #2100 / item-5 path).
   - **with_predicate** — scans with a bound predicate, row-group filtering and 
row selection enabled.
   
   These run on the local FS, so they isolate CPU and per-task work. They are 
not a substitute for object-store latency benchmarks, but they give a 
reproducible baseline that any of the remaining #2172 optimizations can be 
measured against.
   
   Willingness to contribute
   
   I have a branch ready and will open a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to