Hello All, I’d like to start a discussion around adding Asynchronous capability to Spark readers by making them capable to run parallel tasks especially when large numbers of small files are involved. Currently, readers are based on BaseReader.next() where each task is opened, fully consumed, and closed before moving on to the next one.
With workloads containing hundreds or thousands of small files (for example, 4–10 KB files), this sequential behavior can introduce significant overhead. Each file is opened independently, and the reader waits for one task to be fully consumed before opening the next. Here more CPU idleness can also be a major issue. One possible improvement is to optionally allow Spark readers to function asynchronously for scans dominated by many small files. At a high level, the idea would be to: - Open multiple small-file scan tasks concurrently, Read from them asynchronously or in parallel and stitch their output into a single buffered iterator or stream for downstream processing The existing sequential behavior would remain the default, with this mode being opt-in or conditionally enabled for small-file-heavy workloads. This could benefit several Iceberg use cases, including compaction or cleanup jobs. *My Question* - Are there known constraints in Spark’s task execution model that would make this approach problematic? - Is it suitable if I plan a proposal for this idea and work around it? I’ve opened a related issue [1] to capture the problem statement and initial thoughts: Any feedback, pointers to prior discussions, or guidance on would be very helpful. [1] Github issue - https://github.com/apache/iceberg/issues/15287 -- Lakhyani Varun Indian Institute of Technology Roorkee Contact: +91 96246 46174
