[jira] [Created] (SPARK-17929) Deadlock when AM restart send RemoveExecutor

Weizhong (JIRA) Thu, 13 Oct 2016 19:57:27 -0700

Weizhong created SPARK-17929:
--------------------------------

             Summary: Deadlock when AM restart send RemoveExecutor
                 Key: SPARK-17929
                 URL: https://issues.apache.org/jira/browse/SPARK-17929
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
            Reporter: Weizhong
            Priority: Minor



We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
{code}
  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. 
The reason might be
    // because (1) disconnected event is not yet received; (2) executors die 
silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      driverEndpoint.askWithRetry[Boolean](
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
re-registered.")))
    }
  }
{code}
but on removeExecutor also need the lock 
"CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
and send RPC will failed, and reset failed
{code}
    private def removeExecutor(executorId: String, reason: ExecutorLossReason): 
Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
            executorsPendingToRemove.remove(executorId).getOrElse(false)
          }
     ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17929) Deadlock when AM restart send RemoveExecutor

Reply via email to