I have added a separate test for this and did Benchmarking against existing
Spark sync readers for compaction (rewrite_data_files).

Benchmarking details:

Added little latency overhead using LockSupport.parkNanos(1_000_000) in
open () function in org/apache/iceberg/spark/source/BatchDataReader.java to
stimulate real IO overhead caused by cloud storages. ( used
WarmUp(iterations = 5) measurement(iterations = 15) for benchmarking)

Result for 1000 files - 15-20 Kb each compaction (rewrite_data_files) for
various cases:

Overhead (ms) Async (s) Sync (s) (existing) % Improvement
No manual Overhead 0.765 0.932 17.9%
1 0.772 2.881 73.2%
5 1.778 8.512 79.1%
10 3.284 15.159 78.3%
15 4.709 21.260 77.8%

Detailed results for 100, 500, 1000 files for all overheads are mentioned
in reference design document:
https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing

High level design/POC is completed from my end and I would be happy with
any feedback, suggestions or review from community to take this further.

On Thu, Feb 12, 2026 at 12:44 PM Varun Lakhyani <[email protected]>
wrote:

> Hello All,
>
> I’d like to start a discussion around adding Asynchronous capability to
> Spark readers by making them capable to run parallel tasks especially when
> large numbers of small files are involved.
> Currently, readers are based on BaseReader.next() where each task is
> opened, fully consumed, and closed before moving on to the next one.
>
> With workloads containing hundreds or thousands of small files (for
> example, 4–10 KB files), this sequential behavior can introduce significant
> overhead. Each file is opened independently, and the reader waits for one
> task to be fully consumed before opening the next. Here more CPU idleness
> can also be a major issue.
>
> One possible improvement is to optionally allow Spark readers to function
> asynchronously for scans dominated by many small files.
> At a high level, the idea would be to:
>
>    - Open multiple small-file scan tasks concurrently, Read from them
>    asynchronously or in parallel and stitch their output into a single
>    buffered iterator or stream for downstream processing
>
> The existing sequential behavior would remain the default, with this mode
> being opt-in or conditionally enabled for small-file-heavy workloads.
> This could benefit several Iceberg use cases, including compaction or
> cleanup jobs.
>
> *My Question*
>
>    - Are there known constraints in Spark’s task execution model that
>    would make this approach problematic?
>    - Is it suitable if I plan a proposal for this idea and work around it?
>
> I’ve opened a related issue [1] to capture the problem statement and
> initial thoughts:
> Any feedback, pointers to prior discussions, or guidance on would be very
> helpful.
>
> [1] Github issue - https://github.com/apache/iceberg/issues/15287
> --
> Lakhyani Varun
> Indian Institute of Technology Roorkee
> Contact: +91 96246 46174
>
>

Reply via email to