GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/14544

    [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable

    ## What changes were proposed in this pull request?
    
    This patch introduces a new configuration, 
`spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior 
in the standalone master where the master will kill Spark applications which 
have experienced too many back-to-back executor failures. The current setting 
is a hardcoded constant (10); this patch replaces that with a new cluster-wide 
configuration.
    
    **Background:** This application-killing was added in 
6b5980da796e0204a7735a31fb454f312bc9daac (from September 2012) and I believe 
that it was designed to prevent a faulty application whose executors could 
never launch from DDOS'ing the Spark cluster via an infinite series of executor 
launch attempts. In a subsequent patch (#1360), this feature was refined to 
prevent applications which have running executors from being killed by this 
code path.
    
    **Motivation for making this configurable:** Previously, if a Spark 
Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` 
executor failures and was left with no executor failures then the Spark master 
would kill that application, but this behavior is problematic in environments 
where the Spark executors run on unstable infrastructure and can all 
simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 
instance while all workers run on ephemeral spot instances then it's possible 
for all executors to die at the same time while the driver stays alive. In this 
case, it may be desirable to keep the Spark application alive so that it can 
recover once new workers and executors are available. In order to accommodate 
this use-case, this patch modifies the Master to never kill faulty applications 
if `spark.deploy.maxExecutorRetries` is negative.
    
    I'd like to merge this patch into master, branch-2.0, and branch-1.6.
    
    ## How was this patch tested?
    
    I tested this manually using `spark-shell` and `local-cluster` mode. This 
is a tricky feature to unit test and historically this code has not changed 
very often, so I'd prefer to skip the additional effort of adding a testing 
framework and would rather rely on manual tests and review for now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
add-setting-for-max-executor-failures

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14544
    
----
commit ed12bb7846e0de446da68a80151debf721fc18d2
Author: Josh Rosen <[email protected]>
Date:   2016-08-08T18:29:35Z

    Add configuration.

commit 3d539ba11c60aead9899526849d90c6272fbcfab
Author: Josh Rosen <[email protected]>
Date:   2016-08-08T18:40:59Z

    Fix docs.

commit d9bab26f7c787cdf37e9243de5ce58a65b5f10e5
Author: Josh Rosen <[email protected]>
Date:   2016-08-08T19:11:25Z

    Add comment warning of lack of test coverage.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to