GitHub user suyanNone reopened a pull request:
https://github.com/apache/spark/pull/8927
[SPARK-10796][CORE]Resubmit stage while lost task in Zombie TaskSets
We meet that problem in Spark 1.3.0, and I also check the latest Spark
code, and I think that problem still exist.
desc:
1. We know a running `ShuffleMapStage` will have multiple `TaskSet`: one
Active TaskSet, multiple Zombie TaskSet.
2. We think a running `ShuffleMapStage` is success only if its partition
are all process success, namely each taskâs MapStatus are all add into
`outputLocs`
3. MapStatus of running `ShuffleMapStage` may succeed by Zombie TaskSet1 /
Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only belong to
one TaskSet, and may be a Zombie TaskSet.
4. If lost a executor, it chanced that some lost-executor related MapStatus
are succeed by some Zombie TaskSet. In current logical, The solution to
resolved that lost MapStatus problem is, each TaskSet re-running that those
tasks which succeed in lost-executor: re-add into `TaskSet's pendingTasks`, and
re-add it paritions into `Stageâs pendingPartitions` . but it is useless if
that lost MapStatus only belong to Zombie TaskSet, it is Zombie, so will never
be scheduled his `pendingTasks`
5. The condition for resubmit stage is only if some task throws
`FetchFailedException`, but may the lost-executor just not empty any MapStatus
of parent Stage for one of running Stages, and itâs happen to that running
`Stage` was lost a MapStatus only belong to a ZombieTask. So if all Zombie
TaskSets are all processed his runningTasks and Active TaskSet are all
processed his pendingTask, then will removed by `TaskSchedulerImp`, then that
running Stage's pending partitions is still nonEmpty. it will hangs......
Examples:
Running Stage 0.0, running TaskSet0.0, Finshed task0.0 in ExecA, running
Task1.0 in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0.1, running TaskSet0.1(which re-run Task1,
Task2), assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting
TaskSchedulerImp schedule.
TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because
itâs Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0
So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty),
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in
TaskSet0.0, but his TaskSets are all removed, so hangs....
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/suyanNone/spark rerun-special
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8927.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8927
----
commit 554c61f800c6c1b25b1002a7255569a9c38e4154
Author: hushan <[email protected]>
Date: 2015-09-24T09:49:22Z
rerun-specail
commit 3b4a683d23f951082df0b9d29dfa094683d235ea
Author: hushan <[email protected]>
Date: 2015-09-28T03:09:05Z
refine
commit f845f33563623a9f3d6858aba893ed8c75453403
Author: hushan <[email protected]>
Date: 2015-09-28T03:14:18Z
refine
commit 301da0a20c94084bc8f783cd0e087e63f07e2124
Author: hushan <[email protected]>
Date: 2015-09-28T03:18:46Z
refine
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]