GitHub user scwf opened a pull request:

    https://github.com/apache/spark/pull/15481

    [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-17929
    
    Now `CoarseGrainedSchedulerBackend` reset will get the lock,
    ```
      protected def reset(): Unit = synchronized {
        numPendingExecutors = 0
        executorsPendingToRemove.clear()
    
        // Remove all the lingering executors that should be removed but not 
yet. The reason might be
        // because (1) disconnected event is not yet received; (2) executors 
die silently.
        executorDataMap.toMap.foreach { case (eid, _) =>
          driverEndpoint.askWithRetry[Boolean](
            RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
re-registered.")))
        }
      }
    ```
     but on removeExecutor also need the lock 
"CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock.
    
    ```
       private def removeExecutor(executorId: String, reason: 
ExecutorLossReason): Unit = {
          logDebug(s"Asked to remove executor $executorId with reason $reason")
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              // This must be synchronized because variables mutated
              // in this block are read when requesting executors
              val killed = CoarseGrainedSchedulerBackend.this.synchronized {
                addressToExecutorId -= executorInfo.executorAddress
                executorDataMap -= executorId
                executorsPendingLossReason -= executorId
                executorsPendingToRemove.remove(executorId).getOrElse(false)
              }
         ...
    
    ## How was this patch tested?
    
    manual test.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KirinKing/spark spark-17929

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15481.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15481
    
----
commit 3681fae6b5364a5cf55700e1510473d8d9b77cd3
Author: w00228970 <wangf...@huawei.com>
Date:   2016-10-14T09:24:30Z

    use send

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to