mridulm commented on code in PR #50230: URL: https://github.com/apache/spark/pull/50230#discussion_r2328695596
########## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ########## @@ -875,6 +875,21 @@ object SQLConf { .checkValue(_ > 0, "The value of spark.sql.shuffle.partitions must be positive") .createWithDefault(200) + val SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED = + buildConf("spark.sql.shuffle.orderIndependentChecksum.enabled") + .doc("Whether to calculate order independent checksum for the shuffle data or not. If " + + "enabled, Spark will calculate a checksum that is independent of the input row order for " + + "each mapper and returns the checksums from executors to driver. This is different from " + + "the checksum computed when spark.shuffle.checksum.enabled is enabled which is sensitive " + + "to shuffle data ordering to detect file corruption. While this checksum will be the " + + "same even if the shuffle row order changes and it is used to detect whether different " + + "task attempts of the same partition produce different output data or not (same set of " + + "keyValue pairs). In case the output data has changed across retries, Spark will need to " + + "retry all tasks of the consumer stages to avoid correctness issues.") + .version("4.1.0") + .booleanConf + .createWithDefault(false) + Review Comment: Agree, the support for efficient checksum computation does not exist currently - and as I [said here](https://github.com/apache/spark/pull/50230#discussion_r2323262804), I expect that to be a potential follow up If you believe this is a prerequisite, we could get the support added (here or as a follow up): for example, a default implementation could simply rely on the existing `hashCode` we use for partitioning (to bootstrap support). If there is no technical reason why this cant be supported for `RDD` applications, let us move it there instead: so that a wider class of users can benefit from it. I look at this as similar to `spark.shuffle.checksum.enabled` to make shuffle more robust and detect issues proactively. If there is usecases where it needs to be dynamically overriden from sql - we can allow for that behavior - the config is only used at initialization time of shuffle, and not subsequently (at runtime). Note: we cannot expect existing RDD users to reformulate their applications using custom RDD's -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org