[ 
https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoquan Zhang updated SPARK-22958:
-----------------------------------
    Description: 
We have encountered the following scenario. We run a very simple job in yarn 
cluster mode. This job needs only one executor to complete. In the running, 
this job was stuck forever.

After checking the job log, we found an issue in the Spark. When executor fails 
to register with driver, YarnAllocator is blind to know it. As a result, the 
variable (numExecutorsRunning) maintained by YarnAllocator does not reflect the 
truth. When this variable is used to allocate resources to the running job, 
misunderstanding happens. As for our job, the misunderstanding results in 
forever stuck.

The more details are as follows. The following figure shows how 
!How new executor is registered.png!


  was:
We have encountered the following scenario. We run a very simple job in yarn 
cluster mode. This job needs only one executor to complete. In the running, 
this job was stuck forever.

After checking the job log, we found an issue in the Spark. When executor fails 
to register with driver, YarnAllocator is blind to know it. As a result, the 
variable (numExecutorsRunning) maintained by YarnAllocator does not reflect the 
truth. When this variable is used to allocate resources to the running job, 
misunderstanding happens. As for our job, the misunderstanding results in 
forever stuck.

The more details are as follows. The following figure shows how 



> Spark is stuck when the only one executor fails to register with driver
> -----------------------------------------------------------------------
>
>                 Key: SPARK-22958
>                 URL: https://issues.apache.org/jira/browse/SPARK-22958
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.0
>            Reporter: Shaoquan Zhang
>         Attachments: How new executor is registered.png
>
>
> We have encountered the following scenario. We run a very simple job in yarn 
> cluster mode. This job needs only one executor to complete. In the running, 
> this job was stuck forever.
> After checking the job log, we found an issue in the Spark. When executor 
> fails to register with driver, YarnAllocator is blind to know it. As a 
> result, the variable (numExecutorsRunning) maintained by YarnAllocator does 
> not reflect the truth. When this variable is used to allocate resources to 
> the running job, misunderstanding happens. As for our job, the 
> misunderstanding results in forever stuck.
> The more details are as follows. The following figure shows how 
> !How new executor is registered.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to