Weizhong created SPARK-17929:

             Summary: Deadlock when AM restart send RemoveExecutor
                 Key: SPARK-17929
                 URL: https://issues.apache.org/jira/browse/SPARK-17929
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
            Reporter: Weizhong
            Priority: Minor

We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0

    // Remove all the lingering executors that should be removed but not yet. 
The reason might be
    // because (1) disconnected event is not yet received; (2) executors die 
    executorDataMap.toMap.foreach { case (eid, _) =>
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
but on removeExecutor also need the lock 
"CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
and send RPC will failed, and reset failed
    private def removeExecutor(executorId: String, reason: ExecutorLossReason): 
Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to