Hua Liu created SPARK-20564:
-------------------------------
Summary: a lot of executor failures when the executor number is
more than 2000
Key: SPARK-20564
URL: https://issues.apache.org/jira/browse/SPARK-20564
Project: Spark
Issue Type: Improvement
Components: Deploy
Affects Versions: 2.1.0, 1.6.2
Reporter: Hua Liu
When we used more than 2000 executors in a spark application, we noticed a
large number of executors cannot connect to driver and as a result they are
marked as failed. In some cases, the failed executor number reached twice of
the requested executor count and thus applications retried and may eventually
fail.
This is because that YarnAllocator requests all missing containers every
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example,
YarnAllocator can ask for and get 2000 containers in one request, and then
launch them. These thousands of executors try to retrieve spark props and
register with driver. However, driver handles executor registration, stop,
removal and spark props retrieval in one thread, and it can not handle such a
large number of RPCs within a short period of time. As a result, some executors
cannot retrieve spark props and/or register. These failed executors are then
marked as failed, cause executor removal and aggravate the overloading of
driver, which causes more executor failures.
This patch adds an extra configuration
spark.yarn.launchContainer.count.simultaneously, which caps the maximal
containers driver can ask for and launch in every
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of
executors grows steadily. The number of executor failures is reduced and
applications can reach the desired number of executors faster.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]