I will be happy to hear further thoughts on this.
Thanks

On Wed, Jun 24, 2026 at 3:37 AM Varun Lakhyani <[email protected]>
wrote:

> I have a PR [1] which doesn't affect current encryption or metrics or any
> other things.
> It just fetches the whole file as a bytes array and lets parquet or any
> format call to in memory rather than cloud that could be the only change
> here.
>
> Also, I will benchmark with the S3 accelerator enabled and will try to
> understand it further.
> That said, for small files the approaches are complementary - the
> accelerator does predictive prefetching which is valuable for large files,
> but for small files below a threshold a single whole-file fetch eliminates
> all prediction overhead entirely with bounded and predictable memory usage
> (capped at the threshold).
>
> The implementation is not tied to Parquet or S3 - EagerInputFile wraps any
> InputFile and works with any format (haven't tested but should work fine)
> I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure but
> the same benefit should be present for ADLS and GCS.
>
> [1] https://github.com/apache/iceberg/pull/16729
>
> On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> I would like to discuss an optimization for Iceberg's Parquet read path,
>> specifically around reducing S3 GET requests for small file workloads -
>> Root Manifest, Datafiles, and small file compaction.
>>
>> *Problem*
>> The current Iceberg flow for Spark readers uses parquet-mr. For each
>> FileScanTask, it issues 3 GET requests:
>>
>>    1. Footer size discovery - 1 GET reads the last 8 bytes of the
>>    Parquet file to find the actual footer size (this.currentIterator =
>>    open(currentTask) in BaseReader.next)
>>    2. Footer fetch - 1 GET reads the footer (this.currentIterator =
>>    open(currentTask) in BaseReader.next)
>>    3. Row group fetch - 1 GET per row group to fetch actual data
>>    (this.current = currentIterator.next() in BaseReader.next)
>>
>>
>> *Background* - arrows-rs (parquet rust implementation)
>> arrow-rs already addresses the first two calls via
>> `with_footer_size_hint`. It fetches a size hint from the bottom of the file
>> containing the actual footer size - if the footer already falls within that
>> fetched range, 1 GET is eliminated. if not, a second GET fetches the
>> footer. DataFusion builds on this today.
>> For our use case, we can go further: since the files are small, instead
>> of a hint we can fetch the whole file at once in a single GET - no memory
>> concern in parquet-mr - eliminating all 3 calls entirely.
>> As the number of files grows, footer request time starts dominating over
>> actual data request time - clearly visible in benchmarks below.
>>
>> *Two Approaches*
>>
>>    1. Implement directly in Iceberg - I have a high-level PR for this
>>    implementation - complete workaround in Iceberg codebase. (
>>    https://github.com/apache/iceberg/pull/16729)
>>    2. Fix upstream in parquet-mr - The architecturally correct path: add
>>    this functionality to parquet-mr itself and use it entirely, mirroring 
>> what
>>    the Rust implementation does natively.
>>
>>
>> *JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement
>> iterations)
>> Combining S3 GET requests alone gives 60-65% improvement, with further
>> gains possible by parallelising them.
>>
>> [image: image.png]
>>
>>
>> As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple
>> small file requirements, a dedicated effort here seems worth pursuing.
>> I would be happy to hear any thoughts on this. Points to discuss are
>> which approach seems more convincing - Iceberg implementation or upstream
>> parquet-mr implementation and further thoughts on the gaps between
>> parquet-mr and arrow-rs specifically around getting footer.
>>
>> [1] PR for high level implementation -
>> https://github.com/apache/iceberg/pull/16729
>> --
>> --
>> Lakhyani Varun
>> Indian Institute of Technology Roorkee
>> Contact: +91 96246 46174
>>
>>

Reply via email to