Yes, the lazy calls are a real issue here, I called open() on the background threads which returns an iterator for that task that ultimately calls InputFile.newStream() down the hierarchy, so the same lazy GET limitation applies.
I've put together a design doc and a WIP implementation if you have a moment to take a look - would appreciate any thoughts. Design doc : https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing Implementation Github PR: https://github.com/apache/iceberg/pull/15341[image: ltp|1771176308898988] On Thu, Feb 12, 2026 at 8:34 PM Steve Loughran <[email protected]> wrote: > > for an object store, overlapping the GET of the next file with the > processing of the first would maximise CPU use, there'd be no conflicting > demand for the core, just an http request issued and awaiting a response on > one thread while the main cpu carries on its work. > > calling InputFile.newStream() async would be enough to start, though any > cloud connector doing lazy GET calls would be postponing any/all IO until > the first reads take place... > > On Thu, 12 Feb 2026 at 07:14, Varun Lakhyani <[email protected]> > wrote: > >> Hello All, >> >> I’d like to start a discussion around adding Asynchronous capability to >> Spark readers by making them capable to run parallel tasks especially when >> large numbers of small files are involved. >> Currently, readers are based on BaseReader.next() where each task is >> opened, fully consumed, and closed before moving on to the next one. >> >> With workloads containing hundreds or thousands of small files (for >> example, 4–10 KB files), this sequential behavior can introduce significant >> overhead. Each file is opened independently, and the reader waits for one >> task to be fully consumed before opening the next. Here more CPU idleness >> can also be a major issue. >> >> One possible improvement is to optionally allow Spark readers to function >> asynchronously for scans dominated by many small files. >> At a high level, the idea would be to: >> >> - Open multiple small-file scan tasks concurrently, Read from them >> asynchronously or in parallel and stitch their output into a single >> buffered iterator or stream for downstream processing >> >> The existing sequential behavior would remain the default, with this mode >> being opt-in or conditionally enabled for small-file-heavy workloads. >> This could benefit several Iceberg use cases, including compaction or >> cleanup jobs. >> >> *My Question* >> >> - Are there known constraints in Spark’s task execution model that >> would make this approach problematic? >> - Is it suitable if I plan a proposal for this idea and work around >> it? >> >> I’ve opened a related issue [1] to capture the problem statement and >> initial thoughts: >> Any feedback, pointers to prior discussions, or guidance on would be very >> helpful. >> >> [1] Github issue - https://github.com/apache/iceberg/issues/15287 >> -- >> Lakhyani Varun >> Indian Institute of Technology Roorkee >> Contact: +91 96246 46174 >> >>
