[
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385510#comment-15385510
]
Lianhui Wang commented on SPARK-2666:
-------------------------------------
[~tgraves] Sorry for late reply. In https://github.com/apache/spark/pull/1572,
it will kill all running tasks before we resubmit for FetchFailed. But
[~kayousterhout] said that it keep the remaining tasks because the running
tasks may hit Fetch failures from different map outputs than the original fetch
failure.
I think the best way is like Mapreduce we just resubmit the map stage of failed
stage. if the reduce stage has a FetchFailed, it just report FetchFailed to
DAGScheduler and fetch other results. Then the reduce stage getOutputStatus of
FetchFailed every hearbeat like https://github.com/apache/spark/pull/3430.
[~tgraves] How about your ideas about this? Thanks.
> Always try to cancel running tasks when a stage is marked as zombie
> -------------------------------------------------------------------
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a
> "zombie" before the task set has completed all of its tasks. For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the
> ShuffleMapOutput, though no attempt has completed all its tasks (at least,
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting
> scheduled, however it does not cancel all currently running tasks. We should
> cancel all running to avoid wasting resources (and also to make the behavior
> a little more clear to the end user). Rather than canceling tasks in each
> case piecemeal, we should refactor the scheduler so that these two actions
> are always taken together -- canceling tasks should go hand-in-hand with
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not*
> necessarily mean the stage attempt has failed. In case (a), the stage
> attempt has failed, but in stage (b) we are not canceling b/c of a failure,
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.
> However, it also has some side-effects like logging that the stage has failed
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b)
> when nothing has failed. So it may need some additional refactoring to go
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]