GitHub user angolon opened a pull request:

    https://github.com/apache/spark/pull/14933

    [SPARK-16533][CORE] - backport driver deadlock fix to 2.0

    ## What changes were proposed in this pull request?
    Backport changes from #14710 and #14925 to 2.0

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/angolon/spark SPARK-16533-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14933
    
----
commit 45a3f220b5f1b08fbe4f8d390755041dd2738e67
Author: Angus Gerry <[email protected]>
Date:   2016-09-01T17:35:31Z

    [SPARK-16533][CORE] resolve deadlocking in driver when executors die
    
    This pull request reverts the changes made as a part of #14605, which 
simply side-steps the deadlock issue. Instead, I propose the following approach:
    * Use `scheduleWithFixedDelay` when calling 
`ExecutorAllocationManager.schedule` for scheduling executor requests. The 
intent of this is that if invocations are delayed beyond the default schedule 
interval on account of lock contention, then we avoid a situation where calls 
to `schedule` are made back-to-back, potentially releasing and then immediately 
reacquiring these locks - further exacerbating contention.
    * Replace a number of calls to `askWithRetry` with `ask` inside of message 
handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us 
queue messages with the relevant endpoints, release whatever locks we might be 
holding, and then block whilst awaiting the response. This change is made at 
the cost of being able to retry should sending the message fail, as retrying 
outside of the lock could easily cause race conditions if other conflicting 
messages have been sent whilst awaiting a response. I believe this to be the 
lesser of two evils, as in many cases these RPC calls are to process local 
components, and so failures are more likely to be deterministic, and timeouts 
are more likely to be caused by lock contention.
    
    Existing tests, and manual tests under yarn-client mode.
    
    Author: Angus Gerry <[email protected]>
    
    Closes #14710 from angolon/SPARK-16533.

commit de488ce0a0025d3c9736a1df6e45d90e265a84d4
Author: Marcelo Vanzin <[email protected]>
Date:   2016-09-01T21:02:58Z

    [SPARK-16533][HOTFIX] Fix compilation on Scala 2.10.
    
    No idea why it was failing (the needed import was there), but
    this makes things work.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #14925 from vanzin/SPARK-16533.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to