SPARK-55311 - ChecksumCheckpointFileManager spawns 800 threads by default in streaming jobs

Ángel Álvarez Pascua Mon, 02 Feb 2026 22:09:52 -0800

Hi,

While running some tests for an article last weekend, I came across the new
ChecksumCheckpointFileManager feature in Spark 4.1.


This feature spawns *4 threads per output partition* in streaming jobs.
Since streaming jobs also use the *default 200 shuffle partitions*, this
could pose a significant resource risk. In fact, I tested it by increasing
spark.sql.shuffle.partitions to 1000 in a simple streaming job and ran into
an *OOM error*—likely due to the creation of 4,000 extra threads (4 threads
× 1000 partitions).

I’ve opened *SPARK-55311
<https://issues.apache.org/jira/browse/SPARK-55311>* regarding this. My
suggestion is that, similar to how *AQE is disabled in streaming jobs*, we
might consider *defaulting to a lower spark.sql.shuffle.partitions value*
(e.g., 25) for streaming workloads.

I’d love to hear your thoughts on this.

Regards,
Ángel Álvarez

SPARK-55311 - ChecksumCheckpointFileManager spawns 800 threads by default in streaming jobs

Reply via email to