GitHub user angolon opened a pull request:
https://github.com/apache/spark/pull/14710
[SPARK-16533][CORE]
## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which
simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling
`ExecutorAllocationManager.schedule` for scheduling executor requests. The
intent of this is that if invocations are delayed beyond the default schedule
interval on account of lock contention, then we avoid a situation where calls
to `schedule` are made back-to-back, potentially releasing and then immediately
reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message
handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us
queue messages with the relevant endpoints, release whatever locks we might be
holding, and then block whilst awaiting the response. This change is made at
the cost of being able to retry should sending the message fail, as retrying
outside of the lock could easily cause race conditions if other conflicting
messages have been sent whilst awaiting a response. I believe this to be the
lesser of two evils, as in many cases these RPC calls are to process local
components, and so failures are more likely to be deterministic, and timeouts
are more likely to be caused by lock contention.
## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/angolon/spark SPARK-16533
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14710.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14710
----
commit cef69bf470199c63b6638933756b1d057dc890d1
Author: Angus Gerry <[email protected]>
Date: 2016-08-19T01:52:58Z
Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handling
messages"
This reverts commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e.
commit 4970b3b0bcd834bbe5d5473a3065f04a48b12643
Author: Angus Gerry <[email protected]>
Date: 2016-08-09T04:45:29Z
[SPARK-16533][CORE] Use scheduleWithFixedDelay when calling
ExecutorAllocatorManager.schedule to ease contention on locks.
commit 920274a3ed0b8278d38d721587a24c9441fa5ff3
Author: Angus Gerry <[email protected]>
Date: 2016-08-04T06:27:56Z
[SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]