GitHub user jsoltren opened a pull request: https://github.com/apache/spark/pull/16650
[SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted ## What changes were proposed in this pull request? In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time. In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted". In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed. ## How was this patch tested? ./dev/run-tests Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed. Testing can likely be improved here; suggestions welcome. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jsoltren/spark SPARK-16554-submit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16650 ---- commit 81af45fbcbe9609cd3edaed692cb92520ea3f6e6 Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-01T04:01:42Z Add test case for killExecutorsOnHost commit 33ac3643a799eaf3a4b8a48db001eb4aa05c39ef Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-01T04:36:26Z BlacklistTracker can ask the SparkContext to kill executors on a host. Still need to wire in configuration. commit da1d91df24310bdc0a748466f1bde746c080ea6f Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-02T17:14:12Z Respond to review feedback: basic changes commit 87bb328f13c9c49c9c0d210394236015aa068690 Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-02T21:13:07Z Add documentation for configuration.md commit 974999c314be2b7b96a0643cb1f20de42210d29a Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-02T22:33:13Z First implementation of actual executor killing in BlacklistTracker commit ebe35f6fc356acc15edb2a0fa1284ed3976481da Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-02T23:14:35Z Additional updates. Not sure if this killing is thread or race safe. commit 56b5b96fc65220604495d5c4e817cfc5071efe22 Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-02T23:25:21Z Add some implementation thoughts in comments to BlacklistTracker commit c4556bd6680b393ffb949bc5b321e38209d91d37 Author: José Hiram Soltren <j...@cloudera.com> Date: 2016-12-13T03:30:11Z Update killing of nodes to use an RPC method for synchronization ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org