[
https://issues.apache.org/jira/browse/MAPREDUCE-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596356#comment-14596356
]
Jason Lowe commented on MAPREDUCE-6409:
---------------------------------------
I think it's a little harsh to treat NMNotYetReadyException as a failure to
launch without any retries. We don't do this for connection refused or socket
connection timeout, yet this is effectively an application-level connection
refusal. I agree with what Vinod mentioned earlier -- we should simply retry
the exception. Can we have NMProxy setup the proxy to retry
NMNotYetReadyException? In most cases the retries will eventually succeed
before it times out, and that's preferable to throwing away the container and
needing to allocate a new one.
> NM restarts could lead to app failures
> --------------------------------------
>
> Key: MAPREDUCE-6409
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6409
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Karthik Kambatla
> Assignee: Robert Kanter
> Priority: Critical
> Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has
> registered with the RM. In MR, this is considered a task attempt failure. A
> few of these could lead to a task/job failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)