[ 
https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325986#comment-16325986
 ] 

Saisai Shao commented on SPARK-22958:
-------------------------------------

Please see this line:

https://github.com/apache/spark/blob/9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L572

> Spark is stuck when the only one executor fails to register with driver
> -----------------------------------------------------------------------
>
>                 Key: SPARK-22958
>                 URL: https://issues.apache.org/jira/browse/SPARK-22958
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.0
>            Reporter: Shaoquan Zhang
>            Priority: Major
>         Attachments: How new executor is registered.png
>
>
> We have encountered the following scenario. We run a very simple job in yarn 
> cluster mode. This job needs only one executor to complete. In the running, 
> this job was stuck forever.
> After checking the job log, we found an issue in the Spark. When executor 
> fails to register with driver, YarnAllocator is blind to know it. As a 
> result, the variable (numExecutorsRunning) maintained by YarnAllocator does 
> not reflect the truth. When this variable is used to allocate resources to 
> the running job, misunderstanding happens. As for our job, the 
> misunderstanding results in forever stuck.
> The more details are as follows. The following figure shows how executor is 
> allocated when the job starts to run. Now suppose only one executor is 
> needed. In the figure, step 1,2,3 show how the executor is allocated. After 
> the executor is allocated, it needs to register with the driver (step 4) and 
> the driver responses to it (step 5). After the 5 steps, the executor can be 
> used to run tasks.
> !How new executor is registered.png!
> In YarnAllocator, when step 3 is finished, it will increase the the variable 
> "numExecutorsRunning" by one  as shown in the following code.
> {code:java}
> def updateInternalState(): Unit = synchronized {
>         // increase the numExecutorsRunning 
>         numExecutorsRunning += 1
>         executorIdToContainer(executorId) = container
>         containerIdToExecutorId(container.getId) = executorId
>         val containerSet = 
> allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
>           new HashSet[ContainerId])
>         containerSet += containerId
>         allocatedContainerToHostMap.put(containerId, executorHostname)
>       }
>       if (numExecutorsRunning < targetNumExecutors) {
>         if (launchContainers) {
>           launcherPool.execute(new Runnable {
>             override def run(): Unit = {
>               try {
>                 new ExecutorRunnable(
>                   Some(container),
>                   conf,
>                   sparkConf,
>                   driverUrl,
>                   executorId,
>                   executorHostname,
>                   executorMemory,
>                   executorCores,
>                   appAttemptId.getApplicationId.toString,
>                   securityMgr,
>                   localResources
>                 ).run()
>                 // step 3 is finished
>                 updateInternalState()
>               } catch {
>                 case NonFatal(e) =>
>                   logError(s"Failed to launch executor $executorId on 
> container $containerId", e)
>                   // Assigned container should be released immediately to 
> avoid unnecessary resource
>                   // occupation.
>                   amClient.releaseAssignedContainer(containerId)
>               }
>             }
>           })
>         } else {
>           // For test only
>           updateInternalState()
>         }
>       } else {
>         logInfo(("Skip launching executorRunnable as runnning Excecutors 
> count: %d " +
>           "reached target Executors count: %d.").format(numExecutorsRunning, 
> targetNumExecutors))
>       }
> {code}
>    
> Imagine the step 3 successes, but the step 4 is failed due to some reason 
> (for example network fluctuation). The variable "numExecutorsRunning" is 
> equal to 1. But, the fact is no executor is running. So, The variable 
> "numExecutorsRunning" does not reflect the real number of running executors. 
> For YarnAllocator, because the variable is equal to 1, it does not allocate 
> any new executor even though no executor is actually running. If one job only 
> needs one executor to complete, it will stuck forever since no executor runs 
> its tasks. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to