mridulm commented on a change in pull request #32401:
URL: https://github.com/apache/spark/pull/32401#discussion_r654799386
##########
File path:
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
##########
@@ -120,6 +124,13 @@
this.writeMetrics = writeMetrics;
this.serializer = dep.serializer();
this.shuffleExecutorComponents = shuffleExecutorComponents;
+ if ((boolean) conf.get(package$.MODULE$.SHUFFLE_CHECKSUM())) {
+ this.checksumEnabled = true;
+ this.partitionChecksums = new Adler32[numPartitions];
+ for (int i = 0; i < numPartitions; i ++) {
+ this.partitionChecksums[i] = new Adler32();
Review comment:
> Anyways, so if we do want the algorithm to be configurable, can we
leverage the RegisterExecutor message for it?
The reason I was initially proposing adding checksum algo to the file name
itself as a suffix is to minimize the state required to reason about which
algorithm is being used. We wont need to pass it from container to ESS, or
persist it across ESS restarts, etc.
Tom's additional suggestion of including it in checksum file itself as
metadata also works - given the current index'ing into the file for a given
partition (8 * partition_id), metadata at end of file might be more convenient
place to record this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]