[ https://issues.apache.org/jira/browse/TEZ-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518337#comment-14518337 ]
Bikas Saha commented on TEZ-2379: --------------------------------- 1) Client issued dag kill that caused all tasks to get kill. 2) Task sent kill request to its attempt and started waiting for attempt for finish 3) Attempt succeeded - sent done 4) Task got attempt success and went into killed state because all its attempts are done 5) Attempt got kill request - it honored that kill request in TerminatedAfterSuccessTransition and sent killed back to task. 6) Task got attempt killed in killed state and that is not handled. >From what I see in the code, 5 seems to be the problem here. The attempt >should ignore kill request if its already done. Attempt is killed when a >different attempt is successful and this attempt is not needed. Or when the >task is killed. Task retroactive kill in which a successful task is killed >(say in order to run it again after node failure) does not use this flow. So >unless we can think of any other use cases for a successful attempt >transitioning to killed, we should ignore kill request in attempt if the >attempt is already succeeded. > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > T_ATTEMPT_KILLED at KILLED > ------------------------------------------------------------------------------------------------------ > > Key: TEZ-2379 > URL: https://issues.apache.org/jira/browse/TEZ-2379 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Priority: Blocker > > {noformat} > 2015-04-28 04:49:32,455 ERROR [Dispatcher thread: Central] impl.TaskImpl: > Can't handle this event at current state for > task_1429683757595_0479_1_03_000013 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > T_ATTEMPT_KILLED at KILLED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:853) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:106) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1874) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1860) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Additional notes: > ============ > Hive - latest build > Tez - master > tpch-200 gb scale q_17 (kill the job in the middle of execution) -- This message was sent by Atlassian JIRA (v6.3.4#6332)