cloud-fan commented on code in PR #50230: URL: https://github.com/apache/spark/pull/50230#discussion_r2328448845
########## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ########## @@ -875,6 +875,21 @@ object SQLConf { .checkValue(_ > 0, "The value of spark.sql.shuffle.partitions must be positive") .createWithDefault(200) + val SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED = + buildConf("spark.sql.shuffle.orderIndependentChecksum.enabled") + .doc("Whether to calculate order independent checksum for the shuffle data or not. If " + + "enabled, Spark will calculate a checksum that is independent of the input row order for " + + "each mapper and returns the checksums from executors to driver. This is different from " + + "the checksum computed when spark.shuffle.checksum.enabled is enabled which is sensitive " + + "to shuffle data ordering to detect file corruption. While this checksum will be the " + + "same even if the shuffle row order changes and it is used to detect whether different " + + "task attempts of the same partition produce different output data or not (same set of " + + "keyValue pairs). In case the output data has changed across retries, Spark will need to " + + "retry all tasks of the consumer stages to avoid correctness issues.") + .version("4.1.0") + .booleanConf + .createWithDefault(false) + Review Comment: Hi @mridulm , I asked to move the conf to SQL because it's only accessed in SQL. I think this feature is better to be controled by a dynamic session config, instead of a cluster level `SparkConf` that must be specified before we start the Spark application. I think this is a good practise that we provide (internal) APIs in Spark Core and Spark SQL will decide to call these APIs or not based on session configs. That being said, moving this config definition back to Spark Core does not add RDD support automatically. Given that RDD elements are of generic type `T`, I think it's better to extend the current RDD APIs to allow users to specify a custom checksum calculator for their data type `T`, instead of adding a config and let Spark provide an inefficient checksum calculator based on Java serializer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org