I will be happy to hear further thoughts on this. Thanks On Wed, Jun 24, 2026 at 3:37 AM Varun Lakhyani <[email protected]> wrote:
> I have a PR [1] which doesn't affect current encryption or metrics or any > other things. > It just fetches the whole file as a bytes array and lets parquet or any > format call to in memory rather than cloud that could be the only change > here. > > Also, I will benchmark with the S3 accelerator enabled and will try to > understand it further. > That said, for small files the approaches are complementary - the > accelerator does predictive prefetching which is valuable for large files, > but for small files below a threshold a single whole-file fetch eliminates > all prediction overhead entirely with bounded and predictable memory usage > (capped at the threshold). > > The implementation is not tied to Parquet or S3 - EagerInputFile wraps any > InputFile and works with any format (haven't tested but should work fine) > I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure but > the same benefit should be present for ADLS and GCS. > > [1] https://github.com/apache/iceberg/pull/16729 > > On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]> > wrote: > >> Hello everyone, >> >> I would like to discuss an optimization for Iceberg's Parquet read path, >> specifically around reducing S3 GET requests for small file workloads - >> Root Manifest, Datafiles, and small file compaction. >> >> *Problem* >> The current Iceberg flow for Spark readers uses parquet-mr. For each >> FileScanTask, it issues 3 GET requests: >> >> 1. Footer size discovery - 1 GET reads the last 8 bytes of the >> Parquet file to find the actual footer size (this.currentIterator = >> open(currentTask) in BaseReader.next) >> 2. Footer fetch - 1 GET reads the footer (this.currentIterator = >> open(currentTask) in BaseReader.next) >> 3. Row group fetch - 1 GET per row group to fetch actual data >> (this.current = currentIterator.next() in BaseReader.next) >> >> >> *Background* - arrows-rs (parquet rust implementation) >> arrow-rs already addresses the first two calls via >> `with_footer_size_hint`. It fetches a size hint from the bottom of the file >> containing the actual footer size - if the footer already falls within that >> fetched range, 1 GET is eliminated. if not, a second GET fetches the >> footer. DataFusion builds on this today. >> For our use case, we can go further: since the files are small, instead >> of a hint we can fetch the whole file at once in a single GET - no memory >> concern in parquet-mr - eliminating all 3 calls entirely. >> As the number of files grows, footer request time starts dominating over >> actual data request time - clearly visible in benchmarks below. >> >> *Two Approaches* >> >> 1. Implement directly in Iceberg - I have a high-level PR for this >> implementation - complete workaround in Iceberg codebase. ( >> https://github.com/apache/iceberg/pull/16729) >> 2. Fix upstream in parquet-mr - The architecturally correct path: add >> this functionality to parquet-mr itself and use it entirely, mirroring >> what >> the Rust implementation does natively. >> >> >> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement >> iterations) >> Combining S3 GET requests alone gives 60-65% improvement, with further >> gains possible by parallelising them. >> >> [image: image.png] >> >> >> As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple >> small file requirements, a dedicated effort here seems worth pursuing. >> I would be happy to hear any thoughts on this. Points to discuss are >> which approach seems more convincing - Iceberg implementation or upstream >> parquet-mr implementation and further thoughts on the gaps between >> parquet-mr and arrow-rs specifically around getting footer. >> >> [1] PR for high level implementation - >> https://github.com/apache/iceberg/pull/16729 >> -- >> -- >> Lakhyani Varun >> Indian Institute of Technology Roorkee >> Contact: +91 96246 46174 >> >>
