[
https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507595#comment-14507595
]
Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------
The RM log shows the two map containers being allocated, container 3
terminating, then container 4 being allocated. All of this seems normal with
the map task failing and the AM requesting a new container. However this is
the interesting part in the RM log:
{noformat}
2015-04-20,21:36:38,633 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1428390739155_23973_01_000004 Container Transitioned from ALLOCATED
to KILLED
2015-04-20,21:36:38,633 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Completed container: container_1428390739155_23973_01_000004 in state: KILLED
event:KILL
{noformat}
Note that the container was allocated yet killed before it was ACQUIRED. That
means the container was never received by the AM. That's why the AM was
confused about receiving the completed container -- it had never seen the
container allocated in the first place. So the next question: is there
anything in the RM log indicating why the container transitioned from ALLOCATED
to KILLED? Was it preempted or...?
This seems like a bug in YARN. The RM is telling the AM a container completed
that it never told the AM about before. The completion info doesn't tell the
AM enough to know, in the general case, which of its requests this could
correspond to and therefore which one it would need to re-request if it still
needs it. If a container is killed before it is ACQUIRED then the RM should
not treat the corresponding ask for that container as being fulfilled.
> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
> Key: MAPREDUCE-6329
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Attachments: syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)