[ https://issues.apache.org/jira/browse/SPARK-22958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shaoquan Zhang updated SPARK-22958: ----------------------------------- Attachment: How new executor is registered.png > Spark is stuck when the only one executor fails to register with driver > ----------------------------------------------------------------------- > > Key: SPARK-22958 > URL: https://issues.apache.org/jira/browse/SPARK-22958 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.1.0 > Reporter: Shaoquan Zhang > Attachments: How new executor is registered.png > > > We have encountered the following scenario. We run a very simple job in yarn > cluster mode. This job needs only one executor to complete. In the running, > this job was stuck forever. > After checking the job log, we found an issue in the Spark. When executor > fails to register with driver, YarnAllocator is blind to know it. As a > result, the variable (numExecutorsRunning) maintained by YarnAllocator does > not reflect the truth. When this variable is used to allocate resources to > the running job, misunderstanding happens. As for our job, the > misunderstanding results in forever stuck. > The more details are as follows. The following figure shows how -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org