[jira] [Commented] (MAPREDUCE-6409) NM restarts could lead to app failures

Jason Lowe (JIRA) Mon, 22 Jun 2015 11:18:56 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596356#comment-14596356
 ]


Jason Lowe commented on MAPREDUCE-6409:
---------------------------------------

I think it's a little harsh to treat NMNotYetReadyException as a failure to 
launch without any retries.  We don't do this for connection refused or socket 
connection timeout, yet this is effectively an application-level connection 
refusal.  I agree with what Vinod mentioned earlier -- we should simply retry 
the exception.  Can we have NMProxy setup the proxy to retry 
NMNotYetReadyException?  In most cases the retries will eventually succeed 
before it times out, and that's preferable to throwing away the container and 
needing to allocate a new one.


> NM restarts could lead to app failures
> --------------------------------------
>
>                 Key: MAPREDUCE-6409
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6409
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Karthik Kambatla
>            Assignee: Robert Kanter
>            Priority: Critical
>         Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch
>
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6409) NM restarts could lead to app failures

Reply via email to