agrawaldevesh commented on a change in pull request #28817:
URL: https://github.com/apache/spark/pull/28817#discussion_r439779723



##########
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -258,26 +262,60 @@ private[spark] class CoarseGrainedExecutorBackend(
     System.exit(code)
   }
 
-  private def decommissionSelf(): Boolean = {
-    logInfo("Decommissioning self w/sync")
-    try {
-      decommissioned = true
-      // Tell master we are are decommissioned so it stops trying to schedule 
us
-      if (driver.nonEmpty) {
-        driver.get.askSync[Boolean](DecommissionExecutor(executorId))
+  private def shutdownIfDone(): Unit = {
+    val numRunningTasks = executor.numRunningTasks
+    logInfo(s"Checking to see if we can shutdown have ${numRunningTasks} 
running tasks.")
+    if (executor.numRunningTasks == 0) {
+      if (env.conf.get(STORAGE_DECOMMISSION_ENABLED)) {
+        val allBlocksMigrated = env.blockManager.decommissionManager match {
+          case Some(m) => m.allBlocksMigrated
+          case None => false // We haven't started migrations yet.
+        }
+        if (allBlocksMigrated) {
+          logInfo("No running tasks, all blocks migrated, stopping.")
+          exitExecutor(0, "Finished decommissioning", notifyDriver = true)

Review comment:
       `exitExecutor` asynchronously sends RemoveExecutor to the driver. Does 
that actually make it to the driver ? There is also this question about if we 
should be using the same `Shutdown`/`StopExecutor` codepath for doing the 
stopping ? (But althought it seems that we do want to intimate to the driver 
that the executor is being removed). 
   
   Interestingly, the driver does indeed respond back with a `StopExecutor` and 
does trigger the clean shutdown path in the executor, but again I wonder if it 
is too late for it. Perhaps we shouldn't be calling `System.exit` here ?
   
   Also, as currently written, this `exitExecutor` could cause job failures: 
Since the `TaskSchedulerImpl` will treat the `ExecutorLossReason` send by the 
executor to the driver as an `exitCausedByApp` and thus penalize the task. 
Instead, I think we shouldn't penalize the running job on a planned executor 
decommission. One workaround might be to actually respond back to the driver 
with `ExecutorDecommission` (which is not used elsewhere currently) and then 
handle that specifically in the `TaskSchedulerImpl`'s determination of 
`exitCausedByApp`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to