vanzin commented on a change in pull request #24072: [SPARK-27112] : Create a
resource ordering between threads to resolve the deadlocks encountered …
URL: https://github.com/apache/spark/pull/24072#discussion_r265325526
##########
File path:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -258,15 +258,20 @@ class CoarseGrainedSchedulerBackend(scheduler:
TaskSchedulerImpl, val rpcEnv: Rp
// Make fake resource offers on all executors
private def makeOffers() {
// Make sure no executor is killed while some task is launching on it
- val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
- // Filter out executors under killing
- val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
- val workOffers = activeExecutors.map {
- case (id, executorData) =>
- new WorkerOffer(id, executorData.executorHost,
executorData.freeCores,
- Some(executorData.executorAddress.hostPort))
- }.toIndexedSeq
- scheduler.resourceOffers(workOffers)
+ // SPARK-27112: We need to ensure that there is ordering of lock
acquisition
+ // between TaskSchedulerImpl and CoarseGrainedSchedulerBackend objects
in order to fix
+ // the deadlock issue exposed in SPARK-27112
+ val taskDescs = scheduler.synchronized {
Review comment:
I took a quick look at the code that calls this, and I'm wondering if
holding the two locks here is really needed.
For context, all this code is inside the RPC endpoint handler. This is a
`ThreadSafeRpcEndpoint` so there's only one message being processed at a time,
meaning that you won't have multiple threads calling `makeOffers` concurrently.
So it seems to me that it would be possible to:
- with the `CoarseGrainedSchedulerBackend.this` lock held, calculate the
works offers.
```
val workOffers = CoarseGrainedSchedulerBackend.this.synchronized {
...
}
```
With the scheduler lock held, make the offers:
```
val taskDesc = scheduler.synchronized {
scheduler.resourceOffers(workOffers)
}
```
And as far as I understand that should work and also be easier to
understand, right?
I also noticed that later this code calls `launchTasks`, and that method
accesses and modifies data in `executorDataMap` without the
`CoarseGrainedSchedulerBackend.this` lock, which is very sketchy.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]