[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang

Jason Lowe (JIRA) Wed, 22 Apr 2015 11:24:55 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507595#comment-14507595
 ]


Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------

The RM log shows the two map containers being allocated, container 3 
terminating, then container 4 being allocated.  All of this seems normal with 
the map task failing and the AM requesting a new container.  However this is 
the interesting part in the RM log:
{noformat}
2015-04-20,21:36:38,633 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1428390739155_23973_01_000004 Container Transitioned from ALLOCATED 
to KILLED
2015-04-20,21:36:38,633 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Completed container: container_1428390739155_23973_01_000004 in state: KILLED 
event:KILL
{noformat}

Note that the container was allocated yet killed before it was ACQUIRED.  That 
means the container was never received by the AM.  That's why the AM was 
confused about receiving the completed container -- it had never seen the 
container allocated in the first place.  So the next question: is there 
anything in the RM log indicating why the container transitioned from ALLOCATED 
to KILLED?  Was it preempted or...?

This seems like a bug in YARN.  The RM is telling the AM a container completed 
that it never told the AM about before.  The completion info doesn't tell the 
AM enough to know, in the general case, which of its requests this could 
correspond to and therefore which one it would need to re-request if it still 
needs it.  If a container is killed before it is ACQUIRED then the RM should 
not treat the corresponding ask for that container as being fulfilled.

> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
>                 Key: MAPREDUCE-6329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>         Attachments: syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang

Reply via email to