[ 
https://issues.apache.org/jira/browse/TEZ-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502133#comment-16502133
 ] 

Jonathan Eagles commented on TEZ-3950:
--------------------------------------

The race is present in LocalTaskSchedulerService. However, the race in 
DagAwareYarnTaskScheduler and YarnTaskSchedulerService is easier to lose since 
there is no message queue in those services and the containerBeingReleased is 
called synchronously.

> Preempted task attempts intermittently marked as FAILED instead of KILLED
> -------------------------------------------------------------------------
>
>                 Key: TEZ-3950
>                 URL: https://issues.apache.org/jira/browse/TEZ-3950
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Jonathan Eagles
>            Priority: Major
>         Attachments: TEZ-3950.fail.patch
>
>
> TestMockDAGAppMaster.testInternalPreemption intermittently fails with 
> expected:<KILLED> but was:<FAILED>
> Crux of the matter is TaskSchedulerManager sends two events
> - 
> TaskScheduler#deallocatedContainer->TaskSchedulerManager#containerBeingReleased->Sends
>  AMContainerStopRequest -> TA_CONTAINER_TERMINATING
> - AMContainerEventCompleted -> TA_CONTAINER_TERMINATED_BY_SYSTEM
> In order to kill a task attempt correctly the second message loop must 
> complete first. The first path is longer so the second message loop completes 
> almost always first. When the first message loop completes first, then the 
> task attempt is marked as FAILED and not KILLED.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to