Similar issue was flagged earlier and a solution was voted to be worked on in the mailing list [1]. Please go through the above findings and discuss your thoughts. Let me know if adding anything else to these could make more sense.
I would appreciate anyone going through this and expressing their views, would be happy to explore in any specific direction and understand alternatives. [1] https://lists.apache.org/thread/jz4534dodn0zhwdg2wojwkkohg2dps9p On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]> wrote: > Hello everyone, > > I would like to discuss an optimization for Iceberg's Parquet read path, > specifically around reducing S3 GET requests for small file workloads - > Root Manifest, Datafiles, and small file compaction. > > *Problem* > The current Iceberg flow for Spark readers uses parquet-mr. For each > FileScanTask, it issues 3 GET requests: > > 1. Footer size discovery - 1 GET reads the last 8 bytes of the Parquet > file to find the actual footer size (this.currentIterator = > open(currentTask) in BaseReader.next) > 2. Footer fetch - 1 GET reads the footer (this.currentIterator = > open(currentTask) in BaseReader.next) > 3. Row group fetch - 1 GET per row group to fetch actual data > (this.current = currentIterator.next() in BaseReader.next) > > > *Background* - arrows-rs (parquet rust implementation) > arrow-rs already addresses the first two calls via > `with_footer_size_hint`. It fetches a size hint from the bottom of the file > containing the actual footer size - if the footer already falls within that > fetched range, 1 GET is eliminated. if not, a second GET fetches the > footer. DataFusion builds on this today. > For our use case, we can go further: since the files are small, instead of > a hint we can fetch the whole file at once in a single GET - no memory > concern in parquet-mr - eliminating all 3 calls entirely. > As the number of files grows, footer request time starts dominating over > actual data request time - clearly visible in benchmarks below. > > *Two Approaches* > > 1. Implement directly in Iceberg - I have a high-level PR for this > implementation - complete workaround in Iceberg codebase. ( > https://github.com/apache/iceberg/pull/16729) > 2. Fix upstream in parquet-mr - The architecturally correct path: add > this functionality to parquet-mr itself and use it entirely, mirroring what > the Rust implementation does natively. > > > *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement > iterations) > Combining S3 GET requests alone gives 60-65% improvement, with further > gains possible by parallelising them. > > [image: image.png] > > > As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple > small file requirements, a dedicated effort here seems worth pursuing. > I would be happy to hear any thoughts on this. Points to discuss are which > approach seems more convincing - Iceberg implementation or upstream > parquet-mr implementation and further thoughts on the gaps between > parquet-mr and arrow-rs specifically around getting footer. > > [1] PR for high level implementation - > https://github.com/apache/iceberg/pull/16729 > -- > -- > Lakhyani Varun > Indian Institute of Technology Roorkee > Contact: +91 96246 46174 > >
