Github user GraceH commented on a diff in the pull request:
https://github.com/apache/spark/pull/7888#discussion_r44493155
--- Diff:
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -509,6 +511,13 @@ private[spark] class ExecutorAllocationManager(
private def onExecutorBusy(executorId: String): Unit = synchronized {
logDebug(s"Clearing idle timer for $executorId because it is now
running a task")
removeTimes.remove(executorId)
+
+ // Executor is added to remove by misjudgment due to async listener
making it as idle).
+ // see SPARK-9552
+ if (executorsPendingToRemove.contains(executorId)) {
--- End diff --
here is the problem.
1. you have executor-1,-2,-3 to be killed (say timeout triggers that)
2. according to our new criteria, only executor-1 is eligible to kill.
and -2,-3 are filtered out (force = false), and not to pass to
`killExecutors`. Only executor-1 send out killing command, and return back its
acknowledgement.
3. we get the acknowledgement (actually it only works for executor-1). and
the current code path will add all executorID(-1,-2,-3) to
`executorsPendingToRemove`. but actually, only -1 is the real killing case.
In the dynamic allocation, we can do that hypothesis, since it only kills
single executor each time. But for multiple executor case, there is no chance
to tell the difference between executorIDs(to kill) and actual idle ones.
Otherwise, we need to change the APIs to return back what the really killed
executor-list.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]