[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902459#comment-14902459 ]
Lianhui Wang commented on SPARK-2666: ------------------------------------- [~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. And i think that's logic is right. it is ok except unit test. > Always try to cancel running tasks when a stage is marked as zombie > ------------------------------------------------------------------- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org