[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

chong chen (JIRA) Mon, 19 Oct 2015 10:44:34 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963672#comment-14963672
 ]


chong chen commented on MAPREDUCE-6513:
---------------------------------------

Varun, Thanks for  your detail analysis. I do have a question though. 

By looking at the flow, if Mapper tasks failed due to node reason, why map 
reduce application master did not treat it map task fail case? If this is the 
case, then current logic will reset map priority to PRIORITY_FAST_FAIL_MAP 
instead of PRIORITY_MAP, so it will have higher priority than reducer based on 
design, then whatever you mention the problem won't be a problem any more. Any 
particular reason why failed task map was not recognized? 

Of course, current YARN RM/AM protocol design is not a strict delta based 
protocol, which suffers from inconsistency between all these parties and cause 
lots of race conditions. It is not an easy work to re-design the protocol, for 
now, what we can do is to fix them one by one. So, I agree to log an issue in 
6514 to track this individual case. 

Chong

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to