vanzin commented on a change in pull request #24072: [SPARK-27112] : Create a
resource ordering between threads to resolve the deadlocks encountered …
URL: https://github.com/apache/spark/pull/24072#discussion_r265329112
##########
File path:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -631,29 +641,33 @@ class CoarseGrainedSchedulerBackend(scheduler:
TaskSchedulerImpl, val rpcEnv: Rp
force: Boolean): Seq[String] = {
logInfo(s"Requesting to kill executor(s) ${executorIds.mkString(", ")}")
- val response = synchronized {
- val (knownExecutors, unknownExecutors) =
executorIds.partition(executorDataMap.contains)
- unknownExecutors.foreach { id =>
- logWarning(s"Executor to kill $id does not exist!")
- }
+ // SPARK-27112: We need to ensure that there is ordering of lock
acquisition
+ // between TaskSchedulerImpl and CoarseGrainedSchedulerBackend objects in
order to fix
+ // the deadlock issue exposed in SPARK-27112
+ val response = scheduler.synchronized {
+ this.synchronized {
+ val (knownExecutors, unknownExecutors) =
executorIds.partition(executorDataMap.contains)
+ unknownExecutors.foreach { id =>
+ logWarning(s"Executor to kill $id does not exist!")
+ }
- // If an executor is already pending to be removed, do not kill it again
(SPARK-9795)
- // If this executor is busy, do not kill it unless we are told to force
kill it (SPARK-9552)
- val executorsToKill = knownExecutors
- .filter { id => !executorsPendingToRemove.contains(id) }
- .filter { id => force || !scheduler.isExecutorBusy(id) }
- executorsToKill.foreach { id => executorsPendingToRemove(id) =
!countFailures }
+ // If an executor is already pending to be removed, do not kill it
again (SPARK-9795)
+ // If this executor is busy, do not kill it unless we are told to
force kill it (SPARK-9552)
+ val executorsToKill = knownExecutors
+ .filter { id => !executorsPendingToRemove.contains(id) }
+ .filter { id => force || !scheduler.isExecutorBusy(id) }
Review comment:
In a similar vein to my previous comment, although I'm less sure about this
one.
This seems to be the only interaction with the scheduler in this method, so
could this filtering be done first thing in the method, with the scheduler lock
held, and then the rest of the code just needs the
`CoarseGrainedSchedulerBackend` lock?
It seems to me the behavior wouldn't change from the current state (where
the internal scheduler state can change while this method is running). And as
in the other case, easier to understand things when you're only holding one
lock.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]