[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Vinod Kumar Vavilapalli (JIRA) Fri, 30 Oct 2015 14:58:13 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983421#comment-14983421
 ]


Vinod Kumar Vavilapalli commented on MAPREDUCE-6513:
----------------------------------------------------

Went through the discussion. Here's what we should do, mostly agreeing with 
what [~chen317] says.
 - Node failure should not be counted towards task-attempt count. So, yes, 
let's continue to mark such tasks as killed.
 - Rescheduling of this killed task can (and must) take higher priority 
independent of whether it is marked as killed or failed. In fact, this was how 
we originally designed the failed-map-should-have-higher-priority concept. In 
sprit, fail-fast-map actually meant maps which retroactively failed, like in 
this case.

[~varun_saxena], I can take a stab at this if you don't have cycles. Let me 
know either-ways.

IAC, this has been a long-standing problem (though I'm very surprised nobody 
caught this till now), so I'd propose we move this out into 2.7.3 so I can make 
progress on the 2.7.2 release. Thoughts? /cc [~Jobo]

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to