[GitHub] [spark] agrawaldevesh commented on a change in pull request #28817: [WIP][SPARK-31197][CORE] Exit the executor once all tasks and migrations are finished built on top of on top of spark20629

GitBox Sat, 13 Jun 2020 20:48:06 -0700


agrawaldevesh commented on a change in pull request #28817:
URL: https://github.com/apache/spark/pull/28817#discussion_r439788696




##########
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -258,26 +262,60 @@ private[spark] class CoarseGrainedExecutorBackend(
     System.exit(code)
   }
 
-  private def decommissionSelf(): Boolean = {
-    logInfo("Decommissioning self w/sync")
-    try {
-      decommissioned = true
-      // Tell master we are are decommissioned so it stops trying to schedule 
us
-      if (driver.nonEmpty) {
-        driver.get.askSync[Boolean](DecommissionExecutor(executorId))
+  private def shutdownIfDone(): Unit = {
+    val numRunningTasks = executor.numRunningTasks
+    logInfo(s"Checking to see if we can shutdown have ${numRunningTasks} 
running tasks.")
+    if (executor.numRunningTasks == 0) {
+      if (env.conf.get(STORAGE_DECOMMISSION_ENABLED)) {
+        val allBlocksMigrated = env.blockManager.decommissionManager match {
+          case Some(m) => m.allBlocksMigrated
+          case None => false // We haven't started migrations yet.
+        }
+        if (allBlocksMigrated) {
+          logInfo("No running tasks, all blocks migrated, stopping.")
+          exitExecutor(0, "Finished decommissioning", notifyDriver = true)

Review comment:
       I was talking about the case where we get shot down before we had a 
chance to cleanly exit on line 276. Say for example, some time out expires and 
the executor/node is brought down. 
   
   Are `decom.sh` and `decommission-slave.sh` expected to wait until the 
executor/worker process has properly shut down ? I think they have some 
timeouts in them to kill the executor ? Or consider a spot kill scenario where 
you got some warning (like 2 minutes) and then the machine is yanked out. 
   
   In this case, the executor will eventually be marked loss via a 
heartbeat/timeout. And that loss would be deemed as the fault of the task, and 
could cause job failures. I am wondering if we can fix that scenario of an 
unclean exit ? 
   
   One workaround I suggested above was to send a message to the driver saying 
that the executor is going to go away soon. When that happens (in a clean or 
unclean way), that loss shouldn't be attributed to the task. 
   
   Perhaps this unclean executor loss/timeout handling is follow up work ? We 
(or rather I) can create Jira's for this under the parent ticket :-). 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on a change in pull request #28817: [WIP][SPARK-31197][CORE] Exit the executor once all tasks and migrations are finished built on top of on top of spark20629

Reply via email to