[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Varun Saxena (JIRA) Fri, 16 Oct 2015 23:01:30 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961762#comment-14961762
 ]


Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

Yes Sunil we need to update the ask to indicate to RM that it need not allocate 
for these reducers. This is what I talked about in one of my comments yesterday.
In short in this JIRA I intend to have a two pronged approach to resolve it.
1. Update the ask to tell RM that it need not allocate for ramped down 
reducers(ramped down in preemptReducesIfNeeded() method). This change we are 
currently testing.
2. Introduce a config or reuse MAPREDUCE-6302 config to determine hanging map 
requests. And do not ramp up reducers if mappers are starved. I have not looked 
at post MAPREDUCE-6302 codd but this is the basic idea

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to