GitHub user jsoltren opened a pull request:
https://github.com/apache/spark/pull/16650
[SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are
Blacklisted
## What changes were proposed in this pull request?
In SPARK-8425, we introduced a mechanism for blacklisting executors and
nodes (hosts). After a certain number of failures, these resources would be
"blacklisted" and no further work would be assigned to them for some period of
time.
In some scenarios, it is better to fail fast, and to simply kill these
unreliable resources. This changes proposes to do so by having the
BlacklistTracker kill unreliable resources when they would otherwise be
"blacklisted".
In order to be thread safe, this code depends on the
CoarseGrainedSchedulerBackend sending a message to the driver backend in order
to do the actual killing. This also helps to prevent a race which would permit
work to begin on a resource (executor or node), between the time the resource
is marked for killing and the time at which it is finally killed.
## How was this patch tested?
./dev/run-tests
Ran
https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh,
and checked logs to see executors and nodes being killed.
Testing can likely be improved here; suggestions welcome.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jsoltren/spark SPARK-16554-submit
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16650.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16650
----
commit 81af45fbcbe9609cd3edaed692cb92520ea3f6e6
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-01T04:01:42Z
Add test case for killExecutorsOnHost
commit 33ac3643a799eaf3a4b8a48db001eb4aa05c39ef
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-01T04:36:26Z
BlacklistTracker can ask the SparkContext to kill executors on a host.
Still need to wire in configuration.
commit da1d91df24310bdc0a748466f1bde746c080ea6f
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-02T17:14:12Z
Respond to review feedback: basic changes
commit 87bb328f13c9c49c9c0d210394236015aa068690
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-02T21:13:07Z
Add documentation for configuration.md
commit 974999c314be2b7b96a0643cb1f20de42210d29a
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-02T22:33:13Z
First implementation of actual executor killing in BlacklistTracker
commit ebe35f6fc356acc15edb2a0fa1284ed3976481da
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-02T23:14:35Z
Additional updates. Not sure if this killing is thread or race safe.
commit 56b5b96fc65220604495d5c4e817cfc5071efe22
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-02T23:25:21Z
Add some implementation thoughts in comments to BlacklistTracker
commit c4556bd6680b393ffb949bc5b321e38209d91d37
Author: José Hiram Soltren <[email protected]>
Date: 2016-12-13T03:30:11Z
Update killing of nodes to use an RPC method for synchronization
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]