[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Rohith Sharma K S (JIRA) Sat, 31 Oct 2015 10:01:22 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984049#comment-14984049
 ]


Rohith Sharma K S commented on MAPREDUCE-6513:
----------------------------------------------

I think Release-2.7.2 NEED-NOT to hold because of this issue since this issue 
is very very rare to appear. And it is very hard to reproduce!!  Given If 
solution is ready--agreed--available , then it is good to move to 2.7.2 only. I 
am fine with either way too!!

Coming back to issue discussion, 
bq. Rescheduling of this killed task can (and must) take higher priority 
independent of whether it is marked as killed or failed
Best way to solve this. This solve other uncovered scenario which about to 
cause like this issue i.e if completed OR running tasks are killed using MR 
client. While trying to reproduce this current issue, I was used to kill 
completed tasks using MR client. And for 3-4 iteration similar to this issue 
ramping up happened but at some point of time the calculations were going 
ABNORMAL to NORMAL!!

And one of the *challenge is about regression*. Even though increasing the 
priority solves hang issues in one way, I am thinking that does configuring 
slow start value to different values  cause hang i.e going in loop. Any 
thoughts?

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to