[
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Wang updated SPARK-17022:
-----------------------------
Well, `YarnSchedulerEndpoint` is a **ThreadSafeRpcEndpoint**, which can only
handle one message at a time.
[quote]
Thread-safety means processing of one message happens before processing of the
next message by
the same [[ThreadSafeRpcEndpoint]].
[/quote]
> Potential deadlock in driver handling message
> ---------------------------------------------
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
> Reporter: Tao Wang
> Assignee: Tao Wang
> Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Suggest t1 < t2 < t3
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one
> of three functions: CoarseGrainedSchedulerBackend.killExecutors,
> CoarseGrainedSchedulerBackend.requestTotalExecutors or
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor
> message.
> When handling RemoveExecutor, it would send the same message to
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make
> it handle no messages until the two message all went timeout.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]