[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

GitBox Mon, 12 Jul 2021 00:38:30 -0700


mridulm commented on a change in pull request #32401:
URL: https://github.com/apache/spark/pull/32401#discussion_r667683511




##########
File path: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
##########
@@ -360,13 +389,41 @@ private[spark] class IndexShuffleBlockResolver(
           if (dataTmp != null && dataTmp.exists() && 
!dataTmp.renameTo(dataFile)) {
             throw new IOException("fail to rename file " + dataTmp + " to " + 
dataFile)
           }
+
+          // write the checksum file
+          checksumTmpOpt.zip(checksumFileOpt).foreach { case (checksumTmp, 
checksumFile) =>
+            val out = new DataOutputStream(
+              new BufferedOutputStream(
+                new FileOutputStream(checksumTmp)
+              )
+            )
+            Utils.tryWithSafeFinally {
+              checksums.foreach(out.writeLong)
+            } {
+              out.close()
+            }
+
+            if (checksumFile.exists()) {
+              checksumFile.delete()
+            }
+            if (!checksumTmp.renameTo(checksumFile)) {
+              // It's not worthwhile to fail here after index file and data 
file are already
+              // successfully stored due to checksum is only used for the 
corner error case.
+              logWarning("fail to rename file " + checksumTmp + " to " + 
checksumFile)

Review comment:
       In the `if` condition 
[above](https://github.com/apache/spark/pull/32401/files#diff-9e9749da596e4dd6c02722f91cd62afc28a44f00c7cebb927ccdeae1629e98a1R354),
 we will cause index/data to be rewritten even if checksum is not present - 
while here we do not fail.
   Should we make both consistent ?
   
   (If we discussed this earlier, apologies - still catching up on mails).

##########
File path: 
core/src/main/java/org/apache/spark/shuffle/checksum/ShuffleChecksumHelper.java
##########
@@ -0,0 +1,66 @@
+package org.apache.spark.shuffle.checksum;
+
+import java.util.Locale;
+import java.util.zip.Adler32;
+import java.util.zip.CRC32;
+import java.util.zip.Checksum;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.SparkException;
+import org.apache.spark.internal.config.package$;
+import org.apache.spark.storage.ShuffleChecksumBlockId;
+
+public class ShuffleChecksumHelper {
+
+  public static boolean isShuffleChecksumEnabled(SparkConf conf) {
+    return (boolean) conf.get(package$.MODULE$.SHUFFLE_CHECKSUM_ENABLED());
+  }
+
+  public static Checksum[] createPartitionChecksumsIfEnabled(int 
numPartitions, SparkConf conf)
+    throws SparkException {
+    Checksum[] partitionChecksums;
+
+    if (!isShuffleChecksumEnabled(conf)) {
+      partitionChecksums = new Checksum[0];

Review comment:
       nit: Pull this empty array out into a static final field

##########
File path: 
core/src/main/java/org/apache/spark/shuffle/checksum/ShuffleChecksumHelper.java
##########
@@ -0,0 +1,66 @@
+package org.apache.spark.shuffle.checksum;
+
+import java.util.Locale;
+import java.util.zip.Adler32;
+import java.util.zip.CRC32;
+import java.util.zip.Checksum;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.SparkException;
+import org.apache.spark.internal.config.package$;
+import org.apache.spark.storage.ShuffleChecksumBlockId;
+
+public class ShuffleChecksumHelper {
+
+  public static boolean isShuffleChecksumEnabled(SparkConf conf) {
+    return (boolean) conf.get(package$.MODULE$.SHUFFLE_CHECKSUM_ENABLED());
+  }
+
+  public static Checksum[] createPartitionChecksumsIfEnabled(int 
numPartitions, SparkConf conf)
+    throws SparkException {
+    Checksum[] partitionChecksums;
+
+    if (!isShuffleChecksumEnabled(conf)) {
+      partitionChecksums = new Checksum[0];
+      return partitionChecksums;
+    }
+
+    String checksumAlgo = 
shuffleChecksumAlgorithm(conf).toLowerCase(Locale.ROOT);
+    switch (checksumAlgo) {
+      case "adler32":
+        partitionChecksums = new Adler32[numPartitions];
+        for (int i = 0; i < numPartitions; i ++) {
+          partitionChecksums[i] = new Adler32();
+        }
+        return partitionChecksums;
+
+      case "crc32":
+        partitionChecksums = new CRC32[numPartitions];
+        for (int i = 0; i < numPartitions; i ++) {
+          partitionChecksums[i] = new CRC32();
+        }
+        return partitionChecksums;
+
+      default:
+        throw new SparkException("Unsupported shuffle checksum algorithm: " + 
checksumAlgo);
+    }
+  }
+
+  public static long[] getChecksumValues(Checksum[] partitionChecksums) {

Review comment:
       nit: short circuit common case of `0 == partitionChecksums.length` with 
returning empty long array ?

##########
File path: 
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
##########
@@ -129,7 +135,7 @@ public void write(Iterator<Product2<K, V>> records) throws 
IOException {
         .createMapOutputWriter(shuffleId, mapId, numPartitions);
     try {
       if (!records.hasNext()) {
-        partitionLengths = 
mapOutputWriter.commitAllPartitions().getPartitionLengths();
+        partitionLengths = mapOutputWriter.commitAllPartitions(new 
long[0]).getPartitionLengths();

Review comment:
       nit: Pull this empty array out into a static final field ?
   (Here and elsewhere below)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

Reply via email to