GitHub user jsoltren opened a pull request:

    https://github.com/apache/spark/pull/16650

    [SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are 
Blacklisted

    ## What changes were proposed in this pull request?
    
    In SPARK-8425, we introduced a mechanism for blacklisting executors and 
nodes (hosts). After a certain number of failures, these resources would be 
"blacklisted" and no further work would be assigned to them for some period of 
time.
    
    In some scenarios, it is better to fail fast, and to simply kill these 
unreliable resources. This changes proposes to do so by having the 
BlacklistTracker kill unreliable resources when they would otherwise be 
"blacklisted".
    
    In order to be thread safe, this code depends on the 
CoarseGrainedSchedulerBackend sending a message to the driver backend in order 
to do the actual killing. This also helps to prevent a race which would permit 
work to begin on a resource (executor or node), between the time the resource 
is marked for killing and the time at which it is finally killed.
    
    ## How was this patch tested?
    
    ./dev/run-tests
    Ran 
https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, 
and checked logs to see executors and nodes being killed.
    
    Testing can likely be improved here; suggestions welcome.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jsoltren/spark SPARK-16554-submit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16650
    
----
commit 81af45fbcbe9609cd3edaed692cb92520ea3f6e6
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-01T04:01:42Z

    Add test case for killExecutorsOnHost

commit 33ac3643a799eaf3a4b8a48db001eb4aa05c39ef
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-01T04:36:26Z

    BlacklistTracker can ask the SparkContext to kill executors on a host. 
Still need to wire in configuration.

commit da1d91df24310bdc0a748466f1bde746c080ea6f
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-02T17:14:12Z

    Respond to review feedback: basic changes

commit 87bb328f13c9c49c9c0d210394236015aa068690
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-02T21:13:07Z

    Add documentation for configuration.md

commit 974999c314be2b7b96a0643cb1f20de42210d29a
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-02T22:33:13Z

    First implementation of actual executor killing in BlacklistTracker

commit ebe35f6fc356acc15edb2a0fa1284ed3976481da
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-02T23:14:35Z

    Additional updates. Not sure if this killing is thread or race safe.

commit 56b5b96fc65220604495d5c4e817cfc5071efe22
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-02T23:25:21Z

    Add some implementation thoughts in comments to BlacklistTracker

commit c4556bd6680b393ffb949bc5b321e38209d91d37
Author: José Hiram Soltren <j...@cloudera.com>
Date:   2016-12-13T03:30:11Z

    Update killing of nodes to use an RPC method for synchronization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to