[
https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325986#comment-16325986
]
Saisai Shao commented on SPARK-22958:
-------------------------------------
Please see this line:
https://github.com/apache/spark/blob/9a96bfc8bf021cb4b6c62fac6ce1bcf87affcd43/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L572
> Spark is stuck when the only one executor fails to register with driver
> -----------------------------------------------------------------------
>
> Key: SPARK-22958
> URL: https://issues.apache.org/jira/browse/SPARK-22958
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 2.1.0
> Reporter: Shaoquan Zhang
> Priority: Major
> Attachments: How new executor is registered.png
>
>
> We have encountered the following scenario. We run a very simple job in yarn
> cluster mode. This job needs only one executor to complete. In the running,
> this job was stuck forever.
> After checking the job log, we found an issue in the Spark. When executor
> fails to register with driver, YarnAllocator is blind to know it. As a
> result, the variable (numExecutorsRunning) maintained by YarnAllocator does
> not reflect the truth. When this variable is used to allocate resources to
> the running job, misunderstanding happens. As for our job, the
> misunderstanding results in forever stuck.
> The more details are as follows. The following figure shows how executor is
> allocated when the job starts to run. Now suppose only one executor is
> needed. In the figure, step 1,2,3 show how the executor is allocated. After
> the executor is allocated, it needs to register with the driver (step 4) and
> the driver responses to it (step 5). After the 5 steps, the executor can be
> used to run tasks.
> !How new executor is registered.png!
> In YarnAllocator, when step 3 is finished, it will increase the the variable
> "numExecutorsRunning" by one as shown in the following code.
> {code:java}
> def updateInternalState(): Unit = synchronized {
> // increase the numExecutorsRunning
> numExecutorsRunning += 1
> executorIdToContainer(executorId) = container
> containerIdToExecutorId(container.getId) = executorId
> val containerSet =
> allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
> new HashSet[ContainerId])
> containerSet += containerId
> allocatedContainerToHostMap.put(containerId, executorHostname)
> }
> if (numExecutorsRunning < targetNumExecutors) {
> if (launchContainers) {
> launcherPool.execute(new Runnable {
> override def run(): Unit = {
> try {
> new ExecutorRunnable(
> Some(container),
> conf,
> sparkConf,
> driverUrl,
> executorId,
> executorHostname,
> executorMemory,
> executorCores,
> appAttemptId.getApplicationId.toString,
> securityMgr,
> localResources
> ).run()
> // step 3 is finished
> updateInternalState()
> } catch {
> case NonFatal(e) =>
> logError(s"Failed to launch executor $executorId on
> container $containerId", e)
> // Assigned container should be released immediately to
> avoid unnecessary resource
> // occupation.
> amClient.releaseAssignedContainer(containerId)
> }
> }
> })
> } else {
> // For test only
> updateInternalState()
> }
> } else {
> logInfo(("Skip launching executorRunnable as runnning Excecutors
> count: %d " +
> "reached target Executors count: %d.").format(numExecutorsRunning,
> targetNumExecutors))
> }
> {code}
>
> Imagine the step 3 successes, but the step 4 is failed due to some reason
> (for example network fluctuation). The variable "numExecutorsRunning" is
> equal to 1. But, the fact is no executor is running. So, The variable
> "numExecutorsRunning" does not reflect the real number of running executors.
> For YarnAllocator, because the variable is equal to 1, it does not allocate
> any new executor even though no executor is actually running. If one job only
> needs one executor to complete, it will stuck forever since no executor runs
> its tasks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]