[
https://issues.apache.org/jira/browse/MAPREDUCE-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107813#comment-13107813
]
Vinod Kumar Vavilapalli commented on MAPREDUCE-3031:
----------------------------------------------------
This is a bug in NM and just about any container which is killed like
this(doing a kill $pid on the node) will be stuck at RUNNING state on the RM. I
found this on the corresponding NM:
{code}
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
CONTAINER_KILLED_ON_REQUEST at RUNNING
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:297)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:39)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:439)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:685)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:69)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:356)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:349)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:619)
{code}
This is because an exit code of 137/143 is treated as a kill request. On hind
sight it turns out this is a bad idea, we should fix this.
> Job Client goes into infinite loop when we kill AM
> --------------------------------------------------
>
> Key: MAPREDUCE-3031
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3031
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.0
> Reporter: Karam Singh
> Fix For: 0.23.0
>
>
> Started a cluster. Submitted a sleep job with around 10000 maps and 1000
> reduces.
> Killed AM with kill -9 by which time already 7000 thousands maps got
> completed.
> On the RM webUI, Application is stuck in Application.RUNNING state. And
> JobClient goes into an infinite loop as RM keeps telling the client that the
> application is running.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira