[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

GitBox Mon, 12 Jul 2021 18:13:23 -0700


mridulm commented on a change in pull request #32401:
URL: https://github.com/apache/spark/pull/32401#discussion_r668356459




##########
File path: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
##########
@@ -360,13 +389,41 @@ private[spark] class IndexShuffleBlockResolver(
           if (dataTmp != null && dataTmp.exists() && 
!dataTmp.renameTo(dataFile)) {
             throw new IOException("fail to rename file " + dataTmp + " to " + 
dataFile)
           }
+
+          // write the checksum file
+          checksumTmpOpt.zip(checksumFileOpt).foreach { case (checksumTmp, 
checksumFile) =>
+            val out = new DataOutputStream(
+              new BufferedOutputStream(
+                new FileOutputStream(checksumTmp)
+              )
+            )
+            Utils.tryWithSafeFinally {
+              checksums.foreach(out.writeLong)
+            } {
+              out.close()
+            }
+
+            if (checksumFile.exists()) {
+              checksumFile.delete()
+            }
+            if (!checksumTmp.renameTo(checksumFile)) {
+              // It's not worthwhile to fail here after index file and data 
file are already
+              // successfully stored due to checksum is only used for the 
corner error case.
+              logWarning("fail to rename file " + checksumTmp + " to " + 
checksumFile)

Review comment:
       I agree, I am fine with this behavior here.
   I was wondering if we have to make it the same 
[above](https://github.com/apache/spark/pull/32401/files#diff-9e9749da596e4dd6c02722f91cd62afc28a44f00c7cebb927ccdeae1629e98a1R354)
 as well ?
   That is, if index/data exists but checksum does not - do we want to rewrite 
index/data just to populate checksum ?
   Or simply avoid writing checksum if it is missing and behave like we are 
doing here ?
   Thoughts ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a change in pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file

Reply via email to