GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/15249
[SPARK-17675] [CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request?
This is a step along the way to SPARK-8425.
To enable incremental review, the first step proposed here is to expand the
blacklisting within tasksets. In particular, this will enable blacklisting for
(task, executor) pairs (this already exists via an undocumented config)
(task, node)
(taskset, executor)
(taskset, node)
In particular, adding (task, node) is critical to making spark
fault-tolerant of one-bad disk in a cluster, without requiring careful tuning
of "spark.task.maxFailures". The other additions are also important to avoid
many misleading task failures and long scheduling delays when there is one bad
node on a large cluster.
Note that some of the code changes here and really required for just this
-- they put pieces in place for SPARK-8425 even though they are not used yet
(eg. the `BlacklistTracker` helper is a little out of place, and
`TaskSetBlacklist` holds onto a little more info than it needs to for just this
change).
## How was this patch tested?
Added unit tests, run tests via jenkins.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark taskset_blacklist_only
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15249.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15249
----
commit 9a6aaede723b206a95a616ae740dcc5ed6b74cbc
Author: mwws <[email protected]>
Date: 2015-12-29T06:01:17Z
enhance blacklist mechanism
1. create new BlacklistTracker and BlacklistStrategy interface to
support
complex use case for blacklist mechanism.
2. make Yarn allocator aware of node blacklist information
3. three strategies implemented for convenience, also user can define
his own strategy
SingleTaskStrategy: remain default behavior before this change.
AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting
stage level node blacklist
ExecutorAndNodeStrategy: different taskSet can share blacklist
information.
commit 5bfe94106f8c49f2c3d60d13bd2bb6389be94e4f
Author: Imran Rashid <[email protected]>
Date: 2016-05-10T17:49:05Z
Update for new design
commit d7adc6711e15994cefc280666486f54efb0f95cd
Author: Imran Rashid <[email protected]>
Date: 2016-07-07T20:32:51Z
(node,task) blacklisting
commit a34e9aeb695958c749d306595d1adebe0207fdf9
Author: Imran Rashid <[email protected]>
Date: 2016-07-07T20:52:43Z
go back to having the blacklist tracker as an option, rather than the no-op
implementation
commit cf5837410818dae093ef15617cb42336a14408db
Author: Imran Rashid <[email protected]>
Date: 2016-07-07T21:32:09Z
dont count shuffle-fetch failures
commit 7fcb266f7816de8127807017a755d518ab5d314e
Author: Imran Rashid <[email protected]>
Date: 2016-07-08T16:57:42Z
make sure we clear the (node, task) blacklist on stage completion, add to
test
commit 487eb66f850ba6cf68cc814d03b7daa085f591b4
Author: Imran Rashid <[email protected]>
Date: 2016-07-11T21:42:15Z
review feedback
commit c22aaad76f07cbe58ea455d18959470e7afb1498
Author: Imran Rashid <[email protected]>
Date: 2016-07-12T01:55:39Z
fix
commit dc2b3ed4cc256fdb588947bf293d7517c754c45b
Author: Imran Rashid <[email protected]>
Date: 2016-07-13T11:28:49Z
all taskset specific blacklisting is now in TaskSetManager
commit 338db65e75c57bdc4af34ee342bc4ab843468b90
Author: Imran Rashid <[email protected]>
Date: 2016-07-13T11:31:03Z
fix
commit fa3e34aceb5a20aa7a0324b50d55e71e9a50ae6d
Author: Imran Rashid <[email protected]>
Date: 2016-07-13T16:30:28Z
Merge branch 'master' into blacklist-SPARK-8425
commit 16afb431f3827d10c622d0f8ec436a8c01c961fb
Author: Imran Rashid <[email protected]>
Date: 2016-07-14T15:58:39Z
review feedback
commit 7aff08a52099334eb6ac242c7ba70a9873aef624
Author: Imran Rashid <[email protected]>
Date: 2016-07-14T21:22:57Z
review feedback
commit e181546bbcce249ace26c952666493a187f95437
Author: Imran Rashid <[email protected]>
Date: 2016-07-14T21:26:54Z
rename conf
commit 351a9a7e2893a0b90c57233d5e44a52c147bb2a8
Author: Imran Rashid <[email protected]>
Date: 2016-07-14T21:45:24Z
use typed confs consistently
commit 572c7773c44010428e4c1b39a4b03fdbf103a3d6
Author: Imran Rashid <[email protected]>
Date: 2016-07-20T17:11:15Z
docs
commit 8cebb01a0ad19eeccd4143557b7cf22836041c70
Author: Imran Rashid <[email protected]>
Date: 2016-07-20T17:35:04Z
api simplification
commit dbf904e80ad892b76d29bf0092db5810d14b3271
Author: Imran Rashid <[email protected]>
Date: 2016-07-20T18:49:07Z
review feedback
commit f0de0db8c54ac2dc66b0ab59b1b6c155a6775014
Author: Imran Rashid <[email protected]>
Date: 2016-07-20T21:02:14Z
fix for config name change
commit 8a12adf445b00e8841eb3df071c0b6adee6c16da
Author: Imran Rashid <[email protected]>
Date: 2016-07-22T17:16:46Z
exclude killed tasks and preempted tasks from blacklist
commit c9e3662dd468fe262fcb663ca1898cac0029f8d7
Author: Imran Rashid <[email protected]>
Date: 2016-07-26T16:20:58Z
combine imports
commit 497e6268b46549502b615b487ebf26361ecf4623
Author: Imran Rashid <[email protected]>
Date: 2016-08-11T21:26:16Z
review feedback
commit 515b18a6d15b1385caf6e79a849e3397e37e8602
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T15:16:36Z
add task timeouts
commit f0428b46e7de9837fa30850be85b7fb482191a48
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T15:24:47Z
separate datastructure to track blacklisted execs in a tsm, to simplify code
commit a5fbce754a4ec8371a85439f88f87ddc15229fc2
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T15:33:03Z
Merge branch 'master' into blacklist-SPARK-8425
commit b582d8e84fc17d908b43b91ad819acfc5b4b79fc
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T15:34:49Z
fix missing import
commit cec36c93c16d3b3f02a4720649c187595471663d
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T15:47:24Z
fix line wrapping
commit 290b3154d8cc179f6f18742ce53f50861fcc2c9c
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T20:06:16Z
fix test by turning off blacklist
commit 8c58ad96db948bec95f2e99fb9298db481e60b58
Author: Imran Rashid <[email protected]>
Date: 2016-08-18T22:04:48Z
unused import
commit f01278038b5ee5cfdb80de9b254ae9fab0bd5f52
Author: Imran Rashid <[email protected]>
Date: 2016-08-22T15:34:32Z
review feedback
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]