GitHub user markhamstra opened a pull request:

    https://github.com/apache/spark/pull/1360

    SPARK-2425 Don't kill a still-running Application because of some 
misbehaving Executors 

    Introduces a LOADING -> RUNNING ApplicationState transition and prevents 
Master from removing an Application with RUNNING Executors.
    
    Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor 
exits over the entire lifetime of the Application, allow that many since any 
Executor successfully began running the Application; 2) Don't remove the 
Application while Master still thinks that there are RUNNING Executors.
    
    This should be fine as long as the ApplicationInfo doesn't believe any 
Executors are forever RUNNING when they are not.  I think that any non-RUNNING 
Executors will eventually no longer be RUNNING in Master's accounting, but 
another set of eyes should confirm that.  This PR also doesn't try to detect 
which nodes have gone rogue or to kill off bad Workers, so repeatedly failing 
Executors will continue to fail and fill up log files with failure reports as 
long as the Application keeps running.    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markhamstra/spark SPARK-2425

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1360
    
----
commit 5b85534d376d682b7e1f97f98acd532a305349f8
Author: Mark Hamstra <markhams...@gmail.com>
Date:   2014-07-09T23:02:43Z

    SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition
    and prevent Master from removing Application with RUNNING Executors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to