GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/13234
[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance
## What changes were proposed in this pull request?
Update of https://github.com/apache/spark/pull/8760 by @mwws. The current
blacklist mechanism only considers one task a time -- this expands that by
considering:
1. When we determine an executor is bad, we blacklist *all* tasks from that
blacklist, both within the taskset and subsequent task sets.
2. When many executors on a node appear to be bad, we blacklist the entire
node.
## How was this patch tested?
Unit tests via jenkins.
Also I ran the additional tests proposed
[here](https://github.com/apache/spark/pull/8559) which include blacklist
tests.
TODO:
[ ] performance tests
[ ] more internal comments (in particular on concurrency)
[ ] manual testing on a cluster
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark blacklist-SPARK-8426
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13234.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13234
----
commit 975a2a3c2b810f6b462eb46813075aac4928c0ae
Author: mwws <[email protected]>
Date: 2015-12-29T06:01:17Z
enhance blacklist mechanism
1. create new BlacklistTracker and BlacklistStrategy interface to
support
complex use case for blacklist mechanism.
2. make Yarn allocator aware of node blacklist information
3. three strategies implemented for convenience, also user can define
his own strategy
SingleTaskStrategy: remain default behavior before this change.
AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting
stage level node blacklist
ExecutorAndNodeStrategy: different taskSet can share blacklist
information.
commit 51d3c88720faffd6a1fb6910b999cdce0d446bcf
Author: mwws <[email protected]>
Date: 2016-01-13T05:43:46Z
change import order to meet new scala style check rule
commit 7e52311bcf4b5528d127d1d0a16bade7c039517e
Author: mwws <[email protected]>
Date: 2016-02-23T05:28:56Z
simplify code and fix typo
1. fix compile error after rebase to latest codebas.
2. simplify configuration.
3. fix typo.
4. enhance comment and unit text.
5. remove unused import.
6. remove ExecutorAndNode strategy.
commit b600604a0920054cf3b33bff047d84cbd302fb3c
Author: Imran Rashid <[email protected]>
Date: 2016-05-10T17:49:05Z
style
commit 45525a118db078f80b3e0e74abe7d7f2e04a7883
Author: Imran Rashid <[email protected]>
Date: 2016-05-10T19:27:39Z
small refactoring
commit f6bb6de673cae7058c26d2f124d3de0d2eb5b06b
Author: Imran Rashid <[email protected]>
Date: 2016-05-20T21:09:13Z
Merge branch 'master' into blacklist-SPARK-8426
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]