Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

via GitHub Sun, 07 Sep 2025 06:54:49 -0700


mridulm commented on code in PR #50230:
URL: https://github.com/apache/spark/pull/50230#discussion_r2328695596



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -875,6 +875,21 @@ object SQLConf {
     .checkValue(_ > 0, "The value of spark.sql.shuffle.partitions must be 
positive")
     .createWithDefault(200)
 
+  val SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED =
+    buildConf("spark.sql.shuffle.orderIndependentChecksum.enabled")
+      .doc("Whether to calculate order independent checksum for the shuffle 
data or not. If " +
+        "enabled, Spark will calculate a checksum that is independent of the 
input row order for " +
+        "each mapper and returns the checksums from executors to driver. This 
is different from " +
+        "the checksum computed when spark.shuffle.checksum.enabled is enabled 
which is sensitive " +
+        "to shuffle data ordering to detect file corruption. While this 
checksum will be the " +
+        "same even if the shuffle row order changes and it is used to detect 
whether different " +
+        "task attempts of the same partition produce different output data or 
not (same set of " +
+        "keyValue pairs). In case the output data has changed across retries, 
Spark will need to " +
+        "retry all tasks of the consumer stages to avoid correctness issues.")
+      .version("4.1.0")
+      .booleanConf
+      .createWithDefault(false)
+

Review Comment:
   Agree, the support for efficient checksum computation does not exist 
currently - and as I [said 
here](https://github.com/apache/spark/pull/50230#discussion_r2323262804), I 
expect that to be a potential follow up
   If you believe this is a prerequisite, we could get the support added  (here 
or as a follow up): for example, a default implementation could simply rely on 
the existing `hashCode` we use for partitioning (to bootstrap support). 
   
   If there is no technical reason why this cant be supported for `RDD` 
applications, let us move it there instead: so that a wider class of users can 
benefit from it. I look at this as similar to `spark.shuffle.checksum.enabled` 
to make shuffle more robust and detect issues proactively.
   If there is usecases where it needs to be dynamically overriden from sql - 
we can allow for that behavior - the config is only used at initialization time 
of shuffle, and not subsequently (at runtime).
   
   Note: we cannot expect existing RDD users to reformulate their applications 
using custom RDD's



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

Reply via email to