Have made some initial implementations based on an approach. Github PR: https://github.com/apache/iceberg/pull/15341 Documentation including findings and solution: https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing
On Thu, Feb 12, 2026 at 8:34 PM Steve Loughran <[email protected]> wrote: > > for an object store, overlapping the GET of the next file with the > processing of the first would maximise CPU use, there'd be no conflicting > demand for the core, just an http request issued and awaiting a response on > one thread while the main cpu carries on its work. > > calling InputFile.newStream() async would be enough to start, though any > cloud connector doing lazy GET calls would be postponing any/all IO until > the first reads take place... > > On Thu, 12 Feb 2026 at 07:14, Varun Lakhyani <[email protected]> > wrote: > >> Hello All, >> >> I’d like to start a discussion around adding Asynchronous capability to >> Spark readers by making them capable to run parallel tasks especially when >> large numbers of small files are involved. >> Currently, readers are based on BaseReader.next() where each task is >> opened, fully consumed, and closed before moving on to the next one. >> >> With workloads containing hundreds or thousands of small files (for >> example, 4–10 KB files), this sequential behavior can introduce significant >> overhead. Each file is opened independently, and the reader waits for one >> task to be fully consumed before opening the next. Here more CPU idleness >> can also be a major issue. >> >> One possible improvement is to optionally allow Spark readers to function >> asynchronously for scans dominated by many small files. >> At a high level, the idea would be to: >> >> - Open multiple small-file scan tasks concurrently, Read from them >> asynchronously or in parallel and stitch their output into a single >> buffered iterator or stream for downstream processing >> >> The existing sequential behavior would remain the default, with this mode >> being opt-in or conditionally enabled for small-file-heavy workloads. >> This could benefit several Iceberg use cases, including compaction or >> cleanup jobs. >> >> *My Question* >> >> - Are there known constraints in Spark’s task execution model that >> would make this approach problematic? >> - Is it suitable if I plan a proposal for this idea and work around >> it? >> >> I’ve opened a related issue [1] to capture the problem statement and >> initial thoughts: >> Any feedback, pointers to prior discussions, or guidance on would be very >> helpful. >> >> [1] Github issue - https://github.com/apache/iceberg/issues/15287 >> >>
