Re: [DISCUSS] Exploring parallel task execution in Spark readers

Varun Lakhyani Mon, 16 Feb 2026 14:20:22 -0800

Have made some initial implementations based on an approach.
Github PR: https://github.com/apache/iceberg/pull/15341
Documentation including findings and solution:
https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing


On Thu, Feb 12, 2026 at 8:34 PM Steve Loughran <[email protected]> wrote:

>
> for an object store, overlapping the GET of the next file with the
> processing of the first would maximise CPU use, there'd be no conflicting
> demand for the core, just an http request issued and awaiting a response on
> one thread while the main cpu carries on its work.
>
> calling InputFile.newStream() async would be enough to start, though any
> cloud connector doing lazy GET calls would be postponing any/all IO until
> the first reads take place...
>
> On Thu, 12 Feb 2026 at 07:14, Varun Lakhyani <[email protected]>
> wrote:
>
>> Hello All,
>>
>> I’d like to start a discussion around adding Asynchronous capability to
>> Spark readers by making them capable to run parallel tasks especially when
>> large numbers of small files are involved.
>> Currently, readers are based on BaseReader.next() where each task is
>> opened, fully consumed, and closed before moving on to the next one.
>>
>> With workloads containing hundreds or thousands of small files (for
>> example, 4–10 KB files), this sequential behavior can introduce significant
>> overhead. Each file is opened independently, and the reader waits for one
>> task to be fully consumed before opening the next. Here more CPU idleness
>> can also be a major issue.
>>
>> One possible improvement is to optionally allow Spark readers to function
>> asynchronously for scans dominated by many small files.
>> At a high level, the idea would be to:
>>
>>    - Open multiple small-file scan tasks concurrently, Read from them
>>    asynchronously or in parallel and stitch their output into a single
>>    buffered iterator or stream for downstream processing
>>
>> The existing sequential behavior would remain the default, with this mode
>> being opt-in or conditionally enabled for small-file-heavy workloads.
>> This could benefit several Iceberg use cases, including compaction or
>> cleanup jobs.
>>
>> *My Question*
>>
>>    - Are there known constraints in Spark’s task execution model that
>>    would make this approach problematic?
>>    - Is it suitable if I plan a proposal for this idea and work around
>>    it?
>>
>> I’ve opened a related issue [1] to capture the problem statement and
>> initial thoughts:
>> Any feedback, pointers to prior discussions, or guidance on would be very
>> helpful.
>>
>> [1] Github issue - https://github.com/apache/iceberg/issues/15287
>>
>>

Re: [DISCUSS] Exploring parallel task execution in Spark readers

Reply via email to