agrawaldevesh commented on a change in pull request #28817:
URL: https://github.com/apache/spark/pull/28817#discussion_r439779723
##########
File path:
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -258,26 +262,60 @@ private[spark] class CoarseGrainedExecutorBackend(
System.exit(code)
}
- private def decommissionSelf(): Boolean = {
- logInfo("Decommissioning self w/sync")
- try {
- decommissioned = true
- // Tell master we are are decommissioned so it stops trying to schedule
us
- if (driver.nonEmpty) {
- driver.get.askSync[Boolean](DecommissionExecutor(executorId))
+ private def shutdownIfDone(): Unit = {
+ val numRunningTasks = executor.numRunningTasks
+ logInfo(s"Checking to see if we can shutdown have ${numRunningTasks}
running tasks.")
+ if (executor.numRunningTasks == 0) {
+ if (env.conf.get(STORAGE_DECOMMISSION_ENABLED)) {
+ val allBlocksMigrated = env.blockManager.decommissionManager match {
+ case Some(m) => m.allBlocksMigrated
+ case None => false // We haven't started migrations yet.
+ }
+ if (allBlocksMigrated) {
+ logInfo("No running tasks, all blocks migrated, stopping.")
+ exitExecutor(0, "Finished decommissioning", notifyDriver = true)
Review comment:
`exitExecutor` asynchronously sends RemoveExecutor to the driver. Does
that actually make it to the driver ? There is also this question about if we
should be using the same `Shutdown`/`StopExecutor` codepath for doing the
stopping ? (But althought it seems that we do want to intimate to the driver
that the executor is being removed).
Interestingly, the driver does indeed respond back with a `StopExecutor` and
does trigger the clean shutdown path in the executor, but again I wonder if it
is too late for it. Perhaps we shouldn't be calling `System.exit` here ?
Also, as currently written, this `exitExecutor` could cause job failures:
Since the `TaskSchedulerImpl` will treat the `ExecutorLossReason` send by the
executor to the driver as an `exitCausedByApp` and thus penalize the task.
Instead, I think we shouldn't penalize the running job on a planned executor
decommission. One workaround might be to actually respond back to the driver
with `ExecutorDecommission` (which is not used elsewhere currently) and then
handle that specifically in the `TaskSchedulerImpl`'s determination of
`exitCausedByApp`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]