[ 
https://issues.apache.org/jira/browse/TEZ-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525517#comment-14525517
 ] 

Bikas Saha commented on TEZ-2379:
---------------------------------

- kill other attempts when one attempt succeeds
If both attempts succeed at the same time, then the task will choose the first 
one that sends it a succeeded event as the successful attempt. The status of 
the other attempt is irrelevant and it can ignore the subsequent kill request.
- kill transition via killUnfinishedAttempt
This is what happened in this jira. If the attempt has already succeeded by the 
time the task has asked it to fail then it should be ok to ignore that in the 
attempt.
-when a canCommit request is denied and then the attempt is killed.
In this case, the attempt with not be in a completed state and will respond to 
the kill request in the running state.

It makes sense for a task to come out of a terminal state in case it needs to 
be re-run. But I dont think the same applies to an attempt. An attempt is an 
attempt. There is no re-incarnation for it.

bq. Kill wait should likely not get affected as kill_wait moves to killed only 
after all attempts are completed.
Is is affected indirectly, like it happened in this jira. Task kill-wait was 
counting the pending attempts to be completed. the successful completion event 
came - the count == number of attempts - and so the task moved from kill wait 
to killed. hence when it got another killed attempt event, it barfed because it 
was expecting to have gotten all attempt final state notifications.

So it seems to me that ignoring the attempt killed event is going to mask the 
real issue.


> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> T_ATTEMPT_KILLED at KILLED
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2379
>                 URL: https://issues.apache.org/jira/browse/TEZ-2379
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Hitesh Shah
>            Priority: Blocker
>         Attachments: TEZ-2379.1.patch
>
>
> {noformat}
> 2015-04-28 04:49:32,455 ERROR [Dispatcher thread: Central] impl.TaskImpl: 
> Can't handle this event at current state for 
> task_1429683757595_0479_1_03_000013
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> T_ATTEMPT_KILLED at KILLED
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
>         at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:853)
>         at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:106)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1874)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1860)
>         at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
>         at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Additional notes:
> ============
> Hive - latest build 
> Tez - master
> tpch-200 gb scale q_17 (kill the job in the middle of execution)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to