[ 
https://issues.apache.org/jira/browse/TEZ-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001214#comment-14001214
 ] 

Bikas Saha commented on TEZ-1122:
---------------------------------

One thing to consider here is de-linking what needs to be a blocking operation 
for the task from also being blocking at the AM (as well as the container). 
When the task calls canCommit() it ends up making the child JVM make a blocking 
RPC call and that RPC call results in the AM making a set of blocking calls 
into the state machine, which by itself is event driven.

> Race between canCommit and Task moving into RUNNING state
> ---------------------------------------------------------
>
>                 Key: TEZ-1122
>                 URL: https://issues.apache.org/jira/browse/TEZ-1122
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Siddharth Seth
>            Priority: Critical
>
> A task moves into RUNNING state via async events generated after a 
> TaskAttempt moves into RUNNING state, which is triggered by getTask().
> canCommit() is a synchronous call on the umbilical - for short running tasks, 
> a canCommit can come in before the async events are handled.
> {code}
> 2014-05-15 13:21:15,531 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: TaskAttempt: 
> [attempt_1400183444139_0007_1_00_000000_0] started. Is using containerId: 
> [container_1400183444139_0007_01_000002] on NM: []
> 2014-05-15 13:21:15,533 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, 
> startTime=1400185273335, containerId=container_1400183444139_0007_01_000002, 
> nodeId=, 
> inProgressLogs=/node/containerlogs/container_1400183444139_0007_01_000002/, 
> completedLogs=localhost:19888/jobhistory/logs///container_1400183444139_0007_01_000002/v_datagen_attempt_1400183444139_0007_1_00_000000_0/
> 2014-05-15 13:21:15,534 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: 
> attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from 
> START_WAIT to RUNNING due to event TA_STARTED_REMOTELY
> 2014-05-15 13:21:15,534 INFO [IPC Server handler 6 on 61779] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: Task not running. Issuing kill to 
> bad commit attempt attempt_1400183444139_0007_1_00_000000_0
> 2014-05-15 13:21:15,534 INFO [AMRM Callback Handler Thread] 
> org.apache.tez.dag.app.rm.TaskScheduler: App total resource memory: 0 cpu: -1 
> taskAllocations: 1
> 2014-05-15 13:21:15,537 INFO [AsyncDispatcher event handler] 
> org.apache.tez.common.counters.Limits: Counter limits initialized with 
> parameters:  GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64, 
> MAX_COUNTERS=1200
> 2014-05-15 13:21:15,541 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: task_1400183444139_0007_1_00_000000 
> Task Transitioned from SCHEDULED to RUNNING
> 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, 
> startTime=1400185273335, finishTime=1400185275542, timeTaken=2207, 
> status=KILLED, diagnostics=, counters=Counters: 0
> 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: 
> attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from 
> RUNNING to KILL_IN_PROGRESS due to event TA_KILL_REQUEST
> 2014-05-15 13:21:15,546 INFO [TaskSchedulerEventHandlerThread] 
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event 
> EventType: S_TA_ENDED
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to