Hey All,

I previously started a discussion on making Spark readers work in parallel
(asynchronously), which is beneficial in cases with large numbers of small
files such as compaction, and I have worked on a POC, high-level design,
implementation, and benchmarking for various scenarios. I presented my
approach and benchmarking results in the Iceberg Spark sync; the recording
may be available in the Iceberg Spark Community Sync Notes [0].

I am planning to submit this work as a GSoC 2026 proposal based on this
idea and was advised to seek formal community vetting on the dev mailing
list.

Previous DISCUSS thread:
https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn

Issue:
https://github.com/apache/iceberg/issues/15287

Prototype implementation:
https://github.com/apache/iceberg/pull/15341

Design document and benchmarking details:
https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing

Initial benchmarking shows noticeable improvements for workloads involving
many small files, particularly when IO latency is present (details in the
design document).

Any feedback (+1 / concerns / suggestions) would be appreciated.
I am specifically looking for community consensus on whether this is a
viable direction for Iceberg before formalizing the GSoC proposal. The GSoC
2026 proposal deadline is March 31 - early feedback would be especially
appreciated.

[0] Iceberg Spark Community Sync Notes:
https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?usp=sharing
-- 
Lakhyani Varun
Indian Institute of Technology Roorkee

Reply via email to