[GitHub] [spark] holdenk commented on a change in pull request #28331: [WIP][SPARK-20629][CORE] Copy shuffle data when nodes are being shutdown

GitBox Tue, 26 May 2020 21:10:26 -0700


holdenk commented on a change in pull request #28331:
URL: https://github.com/apache/spark/pull/28331#discussion_r430563960




##########
File path: 
core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionSuite.scala
##########
@@ -69,36 +84,64 @@ class BlockManagerDecommissionSuite extends SparkFunSuite 
with LocalSparkContext
     })
 
     // Cache the RDD lazily
-    sleepyRdd.persist()
+    if (persist) {
+      testRdd.persist()
+    }
 
     // Start the computation of RDD - this step will also cache the RDD
-    val asyncCount = sleepyRdd.countAsync()
+    val asyncCount = testRdd.countAsync()
 
     // Wait for the job to have started
     sem.acquire(1)
 
+    // Give Spark a tiny bit to start the tasks after the listener says hello
+    Thread.sleep(100)

Review comment:
       So this wait is for the task to be properly scheduled not for the number 
of execs. The assert for the number of execs is something I think we should 
keep because we can to make sure that decommissioning isn’t the same as exiti.

##########
File path: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
##########
@@ -148,6 +170,86 @@ private[spark] class IndexShuffleBlockResolver(
     }
   }
 
+  /**
+   * Write a provided shuffle block as a stream. Used for block migrations.
+   * ShuffleBlockBatchIds must contain the full range represented in the 
ShuffleIndexBlock.
+   * Requires the caller to delete any shuffle index blocks where the shuffle 
block fails to
+   * put.
+   */
+  def putShuffleBlockAsStream(blockId: BlockId, serializerManager: 
SerializerManager):
+      StreamCallbackWithID = {
+    val file = blockId match {
+      case ShuffleIndexBlockId(shuffleId, mapId, _) =>
+        getIndexFile(shuffleId, mapId)
+      case ShuffleBlockBatchId(shuffleId, mapId, _, _) =>
+        getDataFile(shuffleId, mapId)
+      case _ =>
+        throw new Exception(s"Unexpected shuffle block transfer ${blockId}")
+    }
+    val fileTmp = Utils.tempFileWith(file)
+    val channel = Channels.newChannel(
+      serializerManager.wrapStream(blockId,
+        new FileOutputStream(fileTmp)))
+
+    new StreamCallbackWithID {
+
+      override def getID: String = blockId.name
+
+      override def onData(streamId: String, buf: ByteBuffer): Unit = {
+        while (buf.hasRemaining) {
+          channel.write(buf)
+        }
+      }
+
+      override def onComplete(streamId: String): Unit = {
+        logTrace(s"Done receiving block $blockId, now putting into local 
shuffle service")
+        channel.close()
+        val diskSize = fileTmp.length()
+        this.synchronized {
+          if (file.exists()) {
+            file.delete()
+          }

Review comment:
       So this mirrors the logic inside of writeIndexFileAndCommit, the 
matching check there was introduced in SPARK-17547
    which I believe is for the situation where an exception occurred during a 
previous write and the filesystem is in a dirty state. So I think we should 
keep it to be safe.

##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1777,7 +1799,7 @@ private[spark] class BlockManager(
 
   def decommissionBlockManager(): Unit = {
     if (!blockManagerDecommissioning) {
-      logInfo("Starting block manager decommissioning process")
+      logInfo("Starting block manager decommissioning process...")
       blockManagerDecommissioning = true
       decommissionManager = Some(new BlockManagerDecommissionManager(conf))
       decommissionManager.foreach(_.start())

Review comment:
       Yeah this was added in the cache block migration PR (now merged)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] holdenk commented on a change in pull request #28331: [WIP][SPARK-20629][CORE] Copy shuffle data when nodes are being shutdown

Reply via email to