Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Jones, Danny Thu, 25 Jun 2026 12:05:49 -0700

I have been meaning to chime in on this thread, I’m part of the S3 team and 
caught up with a few folks who have better context than me on the analytics 
accelerator (AAL for S3). (I’m usually having fun with iceberg-rust day-to-day.)


I think it’s great to see optimizations upstream, either to iceberg-java or to 
parquet-mr. One of driving reasons behind AAL was to be able to deliver a lot 
of meaningful improvements across different analytics libraries (primarily 
S3FileIO and S3A), but ultimately I would second Dan’s point that it will be 
great to see these sorts of optimizations made accessible to all users of 
iceberg-java (and parquet-mr even!). In the meantime, users can opt-in to the 
accelerator for S3-based workloads.
The changes proposed by Varun sound good. There are a few others we had in mind 
– Steve L mentioned integration with vectored IO APIs which would deliver read 
optimizations in the right layer without the IO stream needing to understand 
the data format.

There are two things I’d recommend as further reading (though this is a bit 
beyond the 3 GET optimization that was the original purpose for this thread):


  *   This doc explored the optimizations made in AAL: 
https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/
  *   This e-mail thread proposed making AAL generic, as a central way to 
optimize streams across many Apache projects. There was interesting discussion 
around pushing the optimizations instead into the iceberg or parquet layers. 
https://lists.apache.org/thread/cy6y5xf5gg8fr12pg64f77gxdrtv52fn

Danny

From: Daniel Weeks <[email protected]>
Reply to: "[email protected]" <[email protected]>
Date: Thursday, 25 June 2026 at 18:56
To: "[email protected]" <[email protected]>
Subject: RE: [EXTERNAL] [DISCUSS] Combine 3 GET calls for parquet reads - Root 
Manifest, Datafiles and compaction of small files


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

I would actually prefer that we don't rely too much on the analytics 
accelerator and rather focus on improving the native implementation.

I'm not opposed to the accelerator but there's a lot of hidden behaviors that 
have other tradeoffs in terms of requests and memory usage that aren't 
necessarily apparent.

Something like this where you have a solution that works across multiple 
implementations is a generally good improvement.

I am interested to see how big the performance difference is though.

-Dan

On Thu, Jun 25, 2026 at 4:08 AM Steve Loughran 
<[email protected]<mailto:[email protected]>> wrote:
commented on the PR.

you should be benchmarking against the aws accelerator as it is likely to show 
less dramatic speedups, and be more honest in the process.

IF you want to do some serious measurement of cost of measurement of s3 
head/get requests in benchmarks,

  1.  turn on s3 bucket logging to collect logs for requests
  2.  set the user agent on your test processes to be unique
  3.  grab the logs and count the requests after
tool to take the aws logs, convert to avro record, after which you can pull 
into spark 
https://github.com/apache/hadoop-cloudstore/blob/main/src/site/markdown/auditlogs.md

doing that as a before/after of any change assesses the real savings of the 
work, independent of execution time.


On Tue, 23 Jun 2026 at 23:08, Varun Lakhyani 
<[email protected]<mailto:[email protected]>> wrote:
I have a PR [1] which doesn't affect current encryption or metrics or any other 
things.
It just fetches the whole file as a bytes array and lets parquet or any format 
call to in memory rather than cloud that could be the only change here.

Also, I will benchmark with the S3 accelerator enabled and will try to 
understand it further.
That said, for small files the approaches are complementary - the accelerator 
does predictive prefetching which is valuable for large files,
but for small files below a threshold a single whole-file fetch eliminates all 
prediction overhead entirely with bounded and predictable memory usage (capped 
at the threshold).

The implementation is not tied to Parquet or S3 - EagerInputFile wraps any 
InputFile and works with any format (haven't tested but should work fine)
I benchmarked Parquet + S3 (60-65% improvement), so not perfectly sure but the 
same benefit should be present for ADLS and GCS.

[1] https://github.com/apache/iceberg/pull/16729

On Tue, Jun 9, 2026 at 2:30 AM Varun Lakhyani 
<[email protected]<mailto:[email protected]>> wrote:
Hello everyone,

I would like to discuss an optimization for Iceberg's Parquet read path, 
specifically around reducing S3 GET requests for small file workloads - Root 
Manifest, Datafiles, and small file compaction.

Problem
The current Iceberg flow for Spark readers uses parquet-mr. For each 
FileScanTask, it issues 3 GET requests:
1.      Footer size discovery - 1 GET reads the last 8 bytes of the Parquet 
file to find the actual footer size (this.currentIterator = open(currentTask) 
in BaseReader.next)
2.      Footer fetch - 1 GET reads the footer (this.currentIterator = 
open(currentTask) in BaseReader.next)
3.      Row group fetch - 1 GET per row group to fetch actual data 
(this.current = currentIterator.next() in BaseReader.next)

Background - arrows-rs (parquet rust implementation)
arrow-rs already addresses the first two calls via `with_footer_size_hint`. It 
fetches a size hint from the bottom of the file containing the actual footer 
size - if the footer already falls within that fetched range, 1 GET is 
eliminated. if not, a second GET fetches the footer. DataFusion builds on this 
today.
For our use case, we can go further: since the files are small, instead of a 
hint we can fetch the whole file at once in a single GET - no memory concern in 
parquet-mr - eliminating all 3 calls entirely.
As the number of files grows, footer request time starts dominating over actual 
data request time - clearly visible in benchmarks below.
Two Approaches
1.      Implement directly in Iceberg - I have a high-level PR for this 
implementation - complete workaround in Iceberg codebase. 
(https://github.com/apache/iceberg/pull/16729)
2.      Fix upstream in parquet-mr - The architecturally correct path: add this 
functionality to parquet-mr itself and use it entirely, mirroring what the Rust 
implementation does natively.

JMH Benchmark Results (20M total rows, S3, 2 warmup + 5 measurement iterations)
Combining S3 GET requests alone gives 60-65% improvement, with further gains 
possible by parallelising them.

Error! Filename not specified.


As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple small 
file requirements, a dedicated effort here seems worth pursuing.
I would be happy to hear any thoughts on this. Points to discuss are which 
approach seems more convincing - Iceberg implementation or upstream parquet-mr 
implementation and further thoughts on the gaps between parquet-mr and arrow-rs 
specifically around getting footer.

[1] PR for high level implementation - 
https://github.com/apache/iceberg/pull/16729
--
--
Lakhyani Varun
Indian Institute of Technology Roorkee
Contact: +91 96246 46174

Re: [DISCUSS] Combine 3 GET calls for parquet reads - Root Manifest, Datafiles and compaction of small files

Reply via email to