Hi, While running some tests for an article last weekend, I came across the new ChecksumCheckpointFileManager feature in Spark 4.1.
This feature spawns *4 threads per output partition* in streaming jobs. Since streaming jobs also use the *default 200 shuffle partitions*, this could pose a significant resource risk. In fact, I tested it by increasing spark.sql.shuffle.partitions to 1000 in a simple streaming job and ran into an *OOM error*—likely due to the creation of 4,000 extra threads (4 threads × 1000 partitions). I’ve opened *SPARK-55311 <https://issues.apache.org/jira/browse/SPARK-55311>* regarding this. My suggestion is that, similar to how *AQE is disabled in streaming jobs*, we might consider *defaulting to a lower spark.sql.shuffle.partitions value* (e.g., 25) for streaming workloads. I’d love to hear your thoughts on this. Regards, Ángel Álvarez
