I have added a separate test for this and did Benchmarking against existing Spark sync readers for compaction (rewrite_data_files).
Benchmarking details: Added little latency overhead using LockSupport.parkNanos(1_000_000) in open () function in org/apache/iceberg/spark/source/BatchDataReader.java to stimulate real IO overhead caused by cloud storages. ( used WarmUp(iterations = 5) measurement(iterations = 15) for benchmarking) Result for 1000 files - 15-20 Kb each compaction (rewrite_data_files) for various cases: Overhead (ms) Async (s) Sync (s) (existing) % Improvement No manual Overhead 0.765 0.932 17.9% 1 0.772 2.881 73.2% 5 1.778 8.512 79.1% 10 3.284 15.159 78.3% 15 4.709 21.260 77.8% Detailed results for 100, 500, 1000 files for all overheads are mentioned in reference design document: https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing High level design/POC is completed from my end and I would be happy with any feedback, suggestions or review from community to take this further. On Thu, Feb 12, 2026 at 12:44 PM Varun Lakhyani <[email protected]> wrote: > Hello All, > > I’d like to start a discussion around adding Asynchronous capability to > Spark readers by making them capable to run parallel tasks especially when > large numbers of small files are involved. > Currently, readers are based on BaseReader.next() where each task is > opened, fully consumed, and closed before moving on to the next one. > > With workloads containing hundreds or thousands of small files (for > example, 4–10 KB files), this sequential behavior can introduce significant > overhead. Each file is opened independently, and the reader waits for one > task to be fully consumed before opening the next. Here more CPU idleness > can also be a major issue. > > One possible improvement is to optionally allow Spark readers to function > asynchronously for scans dominated by many small files. > At a high level, the idea would be to: > > - Open multiple small-file scan tasks concurrently, Read from them > asynchronously or in parallel and stitch their output into a single > buffered iterator or stream for downstream processing > > The existing sequential behavior would remain the default, with this mode > being opt-in or conditionally enabled for small-file-heavy workloads. > This could benefit several Iceberg use cases, including compaction or > cleanup jobs. > > *My Question* > > - Are there known constraints in Spark’s task execution model that > would make this approach problematic? > - Is it suitable if I plan a proposal for this idea and work around it? > > I’ve opened a related issue [1] to capture the problem statement and > initial thoughts: > Any feedback, pointers to prior discussions, or guidance on would be very > helpful. > > [1] Github issue - https://github.com/apache/iceberg/issues/15287 > -- > Lakhyani Varun > Indian Institute of Technology Roorkee > Contact: +91 96246 46174 > >
