[
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909990#comment-14909990
]
SuYan edited comment on SPARK-10796 at 9/28/15 3:00 AM:
--------------------------------------------------------
Running Stage 0.0, running TaskSet0.0, Finshed task0.0 in ExecA, running
Task1.0 in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0.1, running TaskSet0.1(which re-run Task1,
Task2), assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting
TaskSchedulerImp schedule.
TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0
So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty),
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0,
but his TaskSets are all removed, so hangs....
was (Author: suyan):
Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA, running Task1.0
in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2),
assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting
TaskSchedulerImp schedule.
TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0
So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty),
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0,
but his TaskSets are all removed, so hangs....
> The Stage taskSets may are all removed while stage still have pending
> partitions after having lost some executors
> -----------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 1.3.0
> Reporter: SuYan
> Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code,
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one
> Active TaskSet, multiple Zombie TaskSet.
> 2. We think a running *ShuffleMapStage* is success only if its partition are
> all process success, namely each task‘s *MapStatus* are all add into
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 /
> Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only belong to
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus*
> are succeed by some Zombie TaskSet. In current logical, The solution to
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that
> those tasks which succeed in lost-executor: re-add into *TaskSet's
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws
> *FetchFailedException*, but may the lost-executor just not empty any
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to
> that running Stage was lost a *MapStatus* only belong to a *ZombieTask*. So
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*,
> then that running Stage's *pending partitions* is still nonEmpty. it will
> hangs......
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]