[
https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shaoquan Zhang updated SPARK-22958:
-----------------------------------
Attachment: How new executor is registered.png
> Spark is stuck when the only one executor fails to register with driver
> -----------------------------------------------------------------------
>
> Key: SPARK-22958
> URL: https://issues.apache.org/jira/browse/SPARK-22958
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 2.1.0
> Reporter: Shaoquan Zhang
> Attachments: How new executor is registered.png
>
>
> We have encountered the following scenario. We run a very simple job in yarn
> cluster mode. This job needs only one executor to complete. In the running,
> this job was stuck forever.
> After checking the job log, we found an issue in the Spark. When executor
> fails to register with driver, YarnAllocator is blind to know it. As a
> result, the variable (numExecutorsRunning) maintained by YarnAllocator does
> not reflect the truth. When this variable is used to allocate resources to
> the running job, misunderstanding happens. As for our job, the
> misunderstanding results in forever stuck.
> The more details are as follows. The following figure shows how
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]