[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

GitBox Sat, 19 Jun 2021 06:56:29 -0700


mridulm commented on a change in pull request #32401:
URL: https://github.com/apache/spark/pull/32401#discussion_r654799386




##########
File path: 
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
##########
@@ -120,6 +124,13 @@
     this.writeMetrics = writeMetrics;
     this.serializer = dep.serializer();
     this.shuffleExecutorComponents = shuffleExecutorComponents;
+    if ((boolean) conf.get(package$.MODULE$.SHUFFLE_CHECKSUM())) {
+      this.checksumEnabled = true;
+      this.partitionChecksums = new Adler32[numPartitions];
+      for (int i = 0; i < numPartitions; i ++) {
+        this.partitionChecksums[i] = new Adler32();

Review comment:
       > Anyways, so if we do want the algorithm to be configurable, can we 
leverage the RegisterExecutor message for it?
   
   The reason I was initially proposing adding checksum algo to the file name 
itself as a suffix is to minimize the state required to reason about which 
algorithm is being used. We wont need to pass it from container to ESS, or 
persist it across ESS restarts, etc.
   
   Tom's additional suggestion of including it in checksum file itself as 
metadata also works - given the current index'ing into the file for a given 
partition (8 * partition_id), metadata at end of file might be more convenient 
place to record this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

Reply via email to