Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Steve Loughran Thu, 25 Jun 2026 04:08:15 -0700

commented on the PR.

you should be benchmarking against the aws accelerator as it is likely to
show less dramatic speedups, and be more honest in the process.


IF you want to do some serious measurement of cost of measurement of s3
head/get requests in benchmarks,

   1. turn on s3 bucket logging to collect logs for requests
   2. set the user agent on your test processes to be unique
   3. grab the logs and count the requests after

tool to take the aws logs, convert to avro record, after which you can pull
into spark
https://github.com/apache/hadoop-cloudstore/blob/main/src/site/markdown/auditlogs.md

doing that as a before/after of any change assesses the real savings of the
work, independent of execution time.


On Tue, 23 Jun 2026 at 23:08, Varun Lakhyani <[email protected]>
wrote:

> I have a PR [1] which doesn't affect current encryption or metrics or any
> other things.
> It just fetches the whole file as a bytes array and lets parquet or any
> format call to in memory rather than cloud that could be the only change
> here.
>
> Also, I will benchmark with the S3 accelerator enabled and will try to
> understand it further.
> That said, for small files the approaches are complementary - the
> accelerator does predictive prefetching which is valuable for large files,
> but for small files below a threshold a single whole-file fetch eliminates
> all prediction overhead entirely with bounded and predictable memory usage
> (capped at the threshold).
>
> The implementation is not tied to Parquet or S3 - EagerInputFile wraps any
> InputFile and works with any format (haven't tested but should work fine)
> I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure but
> the same benefit should be present for ADLS and GCS.
>
> [1] https://github.com/apache/iceberg/pull/16729
>
> On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> I would like to discuss an optimization for Iceberg's Parquet read path,
>> specifically around reducing S3 GET requests for small file workloads -
>> Root Manifest, Datafiles, and small file compaction.
>>
>> *Problem*
>> The current Iceberg flow for Spark readers uses parquet-mr. For each
>> FileScanTask, it issues 3 GET requests:
>>
>>    1. Footer size discovery - 1 GET reads the last 8 bytes of the
>>    Parquet file to find the actual footer size (this.currentIterator =
>>    open(currentTask) in BaseReader.next)
>>    2. Footer fetch - 1 GET reads the footer (this.currentIterator =
>>    open(currentTask) in BaseReader.next)
>>    3. Row group fetch - 1 GET per row group to fetch actual data
>>    (this.current = currentIterator.next() in BaseReader.next)
>>
>>
>> *Background* - arrows-rs (parquet rust implementation)
>> arrow-rs already addresses the first two calls via
>> `with_footer_size_hint`. It fetches a size hint from the bottom of the file
>> containing the actual footer size - if the footer already falls within that
>> fetched range, 1 GET is eliminated. if not, a second GET fetches the
>> footer. DataFusion builds on this today.
>> For our use case, we can go further: since the files are small, instead
>> of a hint we can fetch the whole file at once in a single GET - no memory
>> concern in parquet-mr - eliminating all 3 calls entirely.
>> As the number of files grows, footer request time starts dominating over
>> actual data request time - clearly visible in benchmarks below.
>>
>> *Two Approaches*
>>
>>    1. Implement directly in Iceberg - I have a high-level PR for this
>>    implementation - complete workaround in Iceberg codebase. (
>>    https://github.com/apache/iceberg/pull/16729)
>>    2. Fix upstream in parquet-mr - The architecturally correct path: add
>>    this functionality to parquet-mr itself and use it entirely, mirroring 
>> what
>>    the Rust implementation does natively.
>>
>>
>> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement
>> iterations)
>> Combining S3 GET requests alone gives 60-65% improvement, with further
>> gains possible by parallelising them.
>>
>> [image: image.png]
>>
>>
>> As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple
>> small file requirements, a dedicated effort here seems worth pursuing.
>> I would be happy to hear any thoughts on this. Points to discuss are
>> which approach seems more convincing - Iceberg implementation or upstream
>> parquet-mr implementation and further thoughts on the gaps between
>> parquet-mr and arrow-rs specifically around getting footer.
>>
>> [1] PR for high level implementation -
>> https://github.com/apache/iceberg/pull/16729
>> --
>> --
>> Lakhyani Varun
>> Indian Institute of Technology Roorkee
>> Contact: +91 96246 46174
>>
>>

Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Reply via email to