[
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-20564:
------------------------------------
Assignee: (was: Apache Spark)
> a lot of executor failures when the executor number is more than 2000
> ---------------------------------------------------------------------
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
> Issue Type: Improvement
> Components: Deploy
> Affects Versions: 1.6.2, 2.1.0
> Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a
> large number of executors cannot connect to driver and as a result they were
> marked as failed. In some cases, the failed executor number reached twice of
> the requested executor count and thus applications retried and may eventually
> fail.
> This is because that YarnAllocator requests all missing containers every
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example,
> YarnAllocator can ask for and get over 2000 containers in one request, and
> then launch them almost simultaneously. These thousands of executors try to
> retrieve spark props and register with driver within seconds. However, driver
> handles executor registration, stop, removal and spark props retrieval in one
> thread, and it can not handle such a large number of RPCs within a short
> period of time. As a result, some executors cannot retrieve spark props
> and/or register. These failed executors are then marked as failed, causing
> executor removal and aggravating the overloading of driver, which leads to
> more executor failures.
> This patch adds an extra configuration
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal
> number of containers that driver can ask for in every
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of
> executors grows steadily. The number of executor failures is reduced and
> applications can reach the desired number of executors faster.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]