[ https://issues.apache.org/jira/browse/TEZ-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001214#comment-14001214 ]
Bikas Saha commented on TEZ-1122: --------------------------------- One thing to consider here is de-linking what needs to be a blocking operation for the task from also being blocking at the AM (as well as the container). When the task calls canCommit() it ends up making the child JVM make a blocking RPC call and that RPC call results in the AM making a set of blocking calls into the state machine, which by itself is event driven. > Race between canCommit and Task moving into RUNNING state > --------------------------------------------------------- > > Key: TEZ-1122 > URL: https://issues.apache.org/jira/browse/TEZ-1122 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.4.0 > Reporter: Siddharth Seth > Priority: Critical > > A task moves into RUNNING state via async events generated after a > TaskAttempt moves into RUNNING state, which is triggered by getTask(). > canCommit() is a synchronous call on the umbilical - for short running tasks, > a canCommit can come in before the async events are handled. > {code} > 2014-05-15 13:21:15,531 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: TaskAttempt: > [attempt_1400183444139_0007_1_00_000000_0] started. Is using containerId: > [container_1400183444139_0007_01_000002] on NM: [] > 2014-05-15 13:21:15,533 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.history.HistoryEventHandler: > [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_STARTED]: > vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, > startTime=1400185273335, containerId=container_1400183444139_0007_01_000002, > nodeId=, > inProgressLogs=/node/containerlogs/container_1400183444139_0007_01_000002/, > completedLogs=localhost:19888/jobhistory/logs///container_1400183444139_0007_01_000002/v_datagen_attempt_1400183444139_0007_1_00_000000_0/ > 2014-05-15 13:21:15,534 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: > attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from > START_WAIT to RUNNING due to event TA_STARTED_REMOTELY > 2014-05-15 13:21:15,534 INFO [IPC Server handler 6 on 61779] > org.apache.tez.dag.app.dag.impl.TaskImpl: Task not running. Issuing kill to > bad commit attempt attempt_1400183444139_0007_1_00_000000_0 > 2014-05-15 13:21:15,534 INFO [AMRM Callback Handler Thread] > org.apache.tez.dag.app.rm.TaskScheduler: App total resource memory: 0 cpu: -1 > taskAllocations: 1 > 2014-05-15 13:21:15,537 INFO [AsyncDispatcher event handler] > org.apache.tez.common.counters.Limits: Counter limits initialized with > parameters: GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64, > MAX_COUNTERS=1200 > 2014-05-15 13:21:15,541 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.TaskImpl: task_1400183444139_0007_1_00_000000 > Task Transitioned from SCHEDULED to RUNNING > 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.history.HistoryEventHandler: > [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, > startTime=1400185273335, finishTime=1400185275542, timeTaken=2207, > status=KILLED, diagnostics=, counters=Counters: 0 > 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: > attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from > RUNNING to KILL_IN_PROGRESS due to event TA_KILL_REQUEST > 2014-05-15 13:21:15,546 INFO [TaskSchedulerEventHandlerThread] > org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event > EventType: S_TA_ENDED > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)