[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Sunil G (JIRA) Fri, 16 Oct 2015 23:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961765#comment-14961765
 ]


Sunil G commented on MAPREDUCE-6513:
------------------------------------

Hi [~varun_saxena]
I feel point 1 can be tracked separately as it may come up with more 
complexity. I can give an example.

Initially AM has placed 10 requests for reducer at timeframe1. Assume in next 
heartbeat from AM, we are trying to reset this count to 5 because of these new 
issues what we found. However RM could have already allocated some containers 
for that already placed request in previous requests.

So for the new heartbeat from AM, we will have a updated ask request for 5 
reducer at timeframe1, and in the response we may have some newly allocated 
containers from RM for the previous requests placed. So AM has to reject or 
update with a new count in next heartbeat and it may go on.

But AM will reject the allocated reducer container, however lot of rejection 
may occur in these corner cases. So we may need to be careful here.

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to