Hello everyone,

I would like to discuss an optimization for Iceberg's Parquet read path,
specifically around reducing S3 GET requests for small file workloads -
Root Manifest, Datafiles, and small file compaction.

*Problem*
The current Iceberg flow for Spark readers uses parquet-mr. For each
FileScanTask, it issues 3 GET requests:

   1. Footer size discovery - 1 GET reads the last 8 bytes of the Parquet
   file to find the actual footer size (this.currentIterator =
   open(currentTask) in BaseReader.next)
   2. Footer fetch - 1 GET reads the footer (this.currentIterator =
   open(currentTask) in BaseReader.next)
   3. Row group fetch - 1 GET per row group to fetch actual data
   (this.current = currentIterator.next() in BaseReader.next)


*Background* - arrows-rs (parquet rust implementation)
arrow-rs already addresses the first two calls via `with_footer_size_hint`.
It fetches a size hint from the bottom of the file containing the actual
footer size - if the footer already falls within that fetched range, 1 GET
is eliminated. if not, a second GET fetches the footer. DataFusion builds
on this today.
For our use case, we can go further: since the files are small, instead of
a hint we can fetch the whole file at once in a single GET - no memory
concern in parquet-mr - eliminating all 3 calls entirely.
As the number of files grows, footer request time starts dominating over
actual data request time - clearly visible in benchmarks below.

*Two Approaches*

   1. Implement directly in Iceberg - I have a high-level PR for this
   implementation - complete workaround in Iceberg codebase. (
   https://github.com/apache/iceberg/pull/16729)
   2. Fix upstream in parquet-mr - The architecturally correct path: add
   this functionality to parquet-mr itself and use it entirely, mirroring what
   the Rust implementation does natively.


*JMH Benchmark Results* (20M total rows, S3, 2 warmup + 5 measurement
iterations)
Combining S3 GET requests alone gives 60-65% improvement, with further
gains possible by parallelising them.

[image: image.png]


As focus shifts towards Root Manifest, Datafiles in Parquet, and multiple
small file requirements, a dedicated effort here seems worth pursuing.
I would be happy to hear any thoughts on this. Points to discuss are which
approach seems more convincing - Iceberg implementation or upstream
parquet-mr implementation and further thoughts on the gaps between
parquet-mr and arrow-rs specifically around getting footer.

[1] PR for high level implementation -
https://github.com/apache/iceberg/pull/16729
--
-- 
Lakhyani Varun
Indian Institute of Technology Roorkee
Contact: +91 96246 46174

Reply via email to