Shaoquan Zhang created SPARK-22958:
--------------------------------------
Summary: Spark is stuck when the only one executor fails to
register with driver
Key: SPARK-22958
URL: https://issues.apache.org/jira/browse/SPARK-22958
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 2.1.0
Reporter: Shaoquan Zhang
We have encountered the following scenario. We run a very simple job in yarn
cluster mode. This job needs only one executor to complete. In the running,
this job was stuck forever.
After checking the job log, we found an issue in the Spark. When executor fails
to register with driver, YarnAllocator is blind to know it. As a result, the
variable (numExecutorsRunning) maintained by YarnAllocator does not reflect the
truth. When this variable is used to allocate resources to the running job,
misunderstanding happens. As for our job, the misunderstanding results in
forever stuck.
The more details are as follows. The following figure shows how
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]