Hi Ángel, Yeah the ChecksumCheckpointFileManager creates background threads to parallelize file uploads from threads using the state store. This only applies to stateful queries (i.e. using state store). You can turn it off by setting `spark.sql.streaming.checkpoint.fileChecksum.enabled` to `false` and see how your experiment behaves.
Also, for Stateful queries, for each partition, Spark creates a state store instance. If you're using the default store in Spark, then it is an in-memory map. As you increase the number of partitions, the number of store instances increases as well, hence node resource usage increases. In reality, you typically won't be running that amount of partitions on a single node and will likely set it based on the number of cores you have available. Or do you have a specific use case that requires 1000 state partitions on a single node? But yeah, a lower default might be useful. On Mon, Feb 2, 2026 at 10:10 PM Ángel Álvarez Pascua < [email protected]> wrote: > Hi, > > While running some tests for an article last weekend, I came across the > new ChecksumCheckpointFileManager feature in Spark 4.1. > > This feature spawns *4 threads per output partition* in streaming jobs. > Since streaming jobs also use the *default 200 shuffle partitions*, this > could pose a significant resource risk. In fact, I tested it by increasing > spark.sql.shuffle.partitions to 1000 in a simple streaming job and ran > into an *OOM error*—likely due to the creation of 4,000 extra threads (4 > threads × 1000 partitions). > > I’ve opened *SPARK-55311 > <https://issues.apache.org/jira/browse/SPARK-55311>* regarding this. My > suggestion is that, similar to how *AQE is disabled in streaming jobs*, > we might consider *defaulting to a lower spark.sql.shuffle.partitions > value* (e.g., 25) for streaming workloads. > > I’d love to hear your thoughts on this. > > Regards, > Ángel Álvarez >
