[ 
https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoquan Zhang updated SPARK-22958:
-----------------------------------
    Attachment: How new executor is registered.png

> Spark is stuck when the only one executor fails to register with driver
> -----------------------------------------------------------------------
>
>                 Key: SPARK-22958
>                 URL: https://issues.apache.org/jira/browse/SPARK-22958
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.0
>            Reporter: Shaoquan Zhang
>         Attachments: How new executor is registered.png
>
>
> We have encountered the following scenario. We run a very simple job in yarn 
> cluster mode. This job needs only one executor to complete. In the running, 
> this job was stuck forever.
> After checking the job log, we found an issue in the Spark. When executor 
> fails to register with driver, YarnAllocator is blind to know it. As a 
> result, the variable (numExecutorsRunning) maintained by YarnAllocator does 
> not reflect the truth. When this variable is used to allocate resources to 
> the running job, misunderstanding happens. As for our job, the 
> misunderstanding results in forever stuck.
> The more details are as follows. The following figure shows how 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to