Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

via GitHub Sat, 06 Sep 2025 21:35:24 -0700


cloud-fan commented on code in PR #50230:
URL: https://github.com/apache/spark/pull/50230#discussion_r2328448845



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -875,6 +875,21 @@ object SQLConf {
     .checkValue(_ > 0, "The value of spark.sql.shuffle.partitions must be 
positive")
     .createWithDefault(200)
 
+  val SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED =
+    buildConf("spark.sql.shuffle.orderIndependentChecksum.enabled")
+      .doc("Whether to calculate order independent checksum for the shuffle 
data or not. If " +
+        "enabled, Spark will calculate a checksum that is independent of the 
input row order for " +
+        "each mapper and returns the checksums from executors to driver. This 
is different from " +
+        "the checksum computed when spark.shuffle.checksum.enabled is enabled 
which is sensitive " +
+        "to shuffle data ordering to detect file corruption. While this 
checksum will be the " +
+        "same even if the shuffle row order changes and it is used to detect 
whether different " +
+        "task attempts of the same partition produce different output data or 
not (same set of " +
+        "keyValue pairs). In case the output data has changed across retries, 
Spark will need to " +
+        "retry all tasks of the consumer stages to avoid correctness issues.")
+      .version("4.1.0")
+      .booleanConf
+      .createWithDefault(false)
+

Review Comment:
   Hi @mridulm , I asked to move the conf to SQL because it's only accessed in 
SQL. I think this feature is better to be controled by a dynamic session 
config, instead of a cluster level `SparkConf` that must be specified before we 
start the Spark application.
   
   I think this is a good practise that we provide (internal) APIs in Spark 
Core and Spark SQL will decide to call these APIs or not based on session 
configs.
   
   That being said, moving this config definition back to Spark Core does not 
add RDD support automatically. Given that RDD elements are of generic type `T`, 
I think it's better to extend the current RDD APIs to allow users to specify a 
custom checksum calculator for their data type `T`, instead of adding a config 
and let Spark provide an inefficient checksum calculator based on Java 
serializer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51756][CORE] Computes RowBasedChecksum in ShuffleWriters [spark]

Reply via email to