[
https://issues.apache.org/jira/browse/TEZ-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518337#comment-14518337
]
Bikas Saha commented on TEZ-2379:
---------------------------------
1) Client issued dag kill that caused all tasks to get kill.
2) Task sent kill request to its attempt and started waiting for attempt for
finish
3) Attempt succeeded - sent done
4) Task got attempt success and went into killed state because all its attempts
are done
5) Attempt got kill request - it honored that kill request in
TerminatedAfterSuccessTransition and sent killed back to task.
6) Task got attempt killed in killed state and that is not handled.
>From what I see in the code, 5 seems to be the problem here. The attempt
>should ignore kill request if its already done. Attempt is killed when a
>different attempt is successful and this attempt is not needed. Or when the
>task is killed. Task retroactive kill in which a successful task is killed
>(say in order to run it again after node failure) does not use this flow. So
>unless we can think of any other use cases for a successful attempt
>transitioning to killed, we should ignore kill request in attempt if the
>attempt is already succeeded.
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> T_ATTEMPT_KILLED at KILLED
> ------------------------------------------------------------------------------------------------------
>
> Key: TEZ-2379
> URL: https://issues.apache.org/jira/browse/TEZ-2379
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Blocker
>
> {noformat}
> 2015-04-28 04:49:32,455 ERROR [Dispatcher thread: Central] impl.TaskImpl:
> Can't handle this event at current state for
> task_1429683757595_0479_1_03_000013
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> T_ATTEMPT_KILLED at KILLED
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:853)
> at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:106)
> at
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1874)
> at
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1860)
> at
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
> at
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Additional notes:
> ============
> Hive - latest build
> Tez - master
> tpch-200 gb scale q_17 (kill the job in the middle of execution)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)