[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Varun Saxena (JIRA) Sat, 31 Oct 2015 04:10:43 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983942#comment-14983942
 ]


Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

Thanks [~vinodkv] for your input.

During an offline chat with [~vvasudev], [~sunilg] and [~rohithsharma] 
yesterday, this JIRA came up for discussion and we too were in general 
agreement with [~cchen1257] as to why should we mix up rescheduling with higher 
priority and task failure. If node becomes unusable, as maps were already 
completed, they should be taken up immediately and if we set higher priority, 
we will achieve that. We can though still not mark this as failed attempt.

I was infact about to raise a JIRA to handle that separately to get attention 
to this issue.
But based on your comment on MAPREDUCE-6514, lets move what I was planning to 
do here to there. So that we can discuss further on it. If required, one more 
JIRA can be raised.

And we can adopt the approach here.
I think I will get cycles for this as this issue came from our customer.

Also I think no need to hold up 2.7.2 for this and we can move it to 2.7.3. 
[~Jobo] should be ok with this as well as he is in my team only. If required 
i.e. if we decide not to use 2.7.3 or 2.7.3 is late, I will merge this in our 
internal branch.

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to