GitHub user GavinGavinNo1 opened a pull request:
https://github.com/apache/spark/pull/16855
[SPARK-13931] Resolve stage hanging up problem in a particular case
## What changes were proposed in this pull request?
When function 'executorLost' is invoked in class 'TaskSetManager', it's
significant to judge whether variable 'isZombie' is set to true.
This pull request fixes the following hang:
1.Open speculation switch in the application.
2.Run this app and suppose last task of shuffleMapStage 1 finishes. Let's
get the record straight, from the eyes of DAG, this stage really finishes, and
from the eyes of TaskSetManager, variable 'isZombie' is set to true, but
variable runningTasksSet isn't empty because of speculation.
3.Suddenly, executor 3 is lost. TaskScheduler receiving this signal,
invokes all executorLost functions of rootPool's taskSetManagers. DAG receiving
this signal, removes all this executor's outputLocs.
4.TaskSetManager adds all this executor's tasks to pendingTasks and tells
DAG they will be resubmitted (Attention: possibly not on time).
5.DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and
going to find that shuffleMapStage 1 is its missing parent because some
outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1
again.
6.DAG still receives Task 'Resubmitted' signal from old taskSetManager, and
increases the number of pendingTasks of shuffleMapStage 1 each time. However,
old taskSetManager won't resolve new task to submit because its variable
'isZombie' is set to true.
7.Finally shuffleMapStage 1 never finishes in DAG together with all stages
depending on it.
## How was this patch tested?
It's quite difficult to construct test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/GavinGavinNo1/spark resolve-stage-blocked2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16855.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16855
----
commit e15b2abedb6fcaf6bac8775f15bdd246fa22902e
Author: GavinGavinNo1 <[email protected]>
Date: 2017-02-08T14:51:59Z
Resolve stage hanging up problem in a particular case
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]