[ 
https://issues.apache.org/jira/browse/TEZ-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568396#comment-14568396
 ] 

Jeff Zhang commented on TEZ-2509:
---------------------------------

Should also stop assign containers in DelayedContainerManager ?



> YarnTaskSchedulerService should not try to allocate containers if AM is 
> shutting down
> -------------------------------------------------------------------------------------
>
>                 Key: TEZ-2509
>                 URL: https://issues.apache.org/jira/browse/TEZ-2509
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Hitesh Shah
>         Attachments: TEZ-2509.1.patch, TEZ-2509.2.patch
>
>
> Observed when doing some recovery testing: 
> Failure as during dag shutdown, 4 attempts of the same task failed. 
> {code}
> 2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: 
> vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, 
> startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, 
> status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, 
> info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid 
> transition - C_NM_STOP_SENT at RUNNING]
> TaskAttempt 1 failed, info=[AttemptId: 
> attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: 
> container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state]
> TaskAttempt 2 failed, info=[Container 
> container_e02_1433141118424_0012_01_000012 hit an invalid transition - 
> C_NM_STOP_SENT at RUNNING]
> TaskAttempt 3 failed, info=[Container 
> container_e02_1433141118424_0012_01_000025 hit an invalid transition - 
> C_NM_STOP_SENT at RUNNING], counters=Counters: 0
> {code}
>   
> DAG kill signal received.
> {code}
> 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: 
> DAGAppMasterShutdownHook invoked
> 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster 
> received a signal. Signaling TaskScheduler
> {code}
> First attempt marked as failed as container was killed.
> {code}
> 2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=initialmap, 
> taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, 
> startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, 
> status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container 
> container_e02_1433141118424_0012_01_000018 hit an invalid transition - 
> C_NM_STOP_SENT at RUNNING, counters=Counters: 0
> {code}
> Subsequent attempt scheduled, assigned and eventually fails. 
> {code}
> 2015-06-01 07:38:26,919 INFO [DelayedContainerManager] 
> rm.YarnTaskSchedulerService: Assigning container to task, 
> container=Container: [ContainerId: 
> container_e02_1433141118424_0012_01_000011, NodeId: 
> ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: 
> ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, 
> Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 
> }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, 
> localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, 
> honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, 
> containerResourceMemory=1536, containerResourceVCores=1
> {code}
> Scheduler stops too late.
> {code}
> 2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to