[
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386561#comment-15386561
]
Thomas Graves commented on SPARK-2666:
--------------------------------------
thanks for the explanation.
I guess we would have to look through the failures cases, but if you are using
the external shuffle service it feels like marking everything on that node as
bad even if its from another executor would be better because this case seems
like more of a node failure or something that would be much more likely to
affect other map outputs.
I guess if its serving shuffle from the executor, it could just be something
bad on that executor ( out of memory, timeout due to overload, etc).
> Always try to cancel running tasks when a stage is marked as zombie
> -------------------------------------------------------------------
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a
> "zombie" before the task set has completed all of its tasks. For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the
> ShuffleMapOutput, though no attempt has completed all its tasks (at least,
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting
> scheduled, however it does not cancel all currently running tasks. We should
> cancel all running to avoid wasting resources (and also to make the behavior
> a little more clear to the end user). Rather than canceling tasks in each
> case piecemeal, we should refactor the scheduler so that these two actions
> are always taken together -- canceling tasks should go hand-in-hand with
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not*
> necessarily mean the stage attempt has failed. In case (a), the stage
> attempt has failed, but in stage (b) we are not canceling b/c of a failure,
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.
> However, it also has some side-effects like logging that the stage has failed
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b)
> when nothing has failed. So it may need some additional refactoring to go
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]