GitHub user jerryshao opened a pull request:
https://github.com/apache/spark/pull/17113
[SPARK-13669][Core]Improve the blacklist mechanism to handle external
shuffle service unavailable situation
## What changes were proposed in this pull request?
Currently we are running into an issue with Yarn work preserving enabled +
external shuffle service.
In the work preserving enabled scenario, the failure of NM will not lead to
the exit of executors, so executors can still accept and run the tasks. The
problem here is when NM is failed, external shuffle service is actually
inaccessible, so reduce tasks will always complain about the âFetch
failureâ, and the failure of reduce stage will make the parent stage (map
stage) rerun. The tricky thing here is Spark scheduler is not aware of the
unavailability of external shuffle service, and will reschedule the map tasks
on the executor where NM is failed, and again reduce stage will be failed with
âFetch failureâ, and after 4 retries, the job is failed. This could also
apply to other cluster manager with external shuffle service.
So here the main problem is that we should avoid assigning tasks to those
bad executors (where shuffle service is unavailable). Current Spark's blacklist
mechanism could blacklist executors/nodes by failure tasks, but it doesn't
handle this specific fetch failure scenario. So here propose to improve the
current application blacklist mechanism to handle fetch failure issue
(especially with external shuffle service unavailable issue), to blacklist the
executors/nodes where shuffle fetch is unavailable.
## How was this patch tested?
Unit test and small cluster verification.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jerryshao/apache-spark SPARK-13669
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17113.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17113
----
commit a90b23d894bfd568a054a5752c958c36aa58c79b
Author: jerryshao <[email protected]>
Date: 2017-03-01T05:54:21Z
Improve the blacklist mechanism to handle external shuffle service
unavailable situation
Change-Id: I1c0776ec98866c5294ea4ed5d98793fdcebf44ae
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]