[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Varun Saxena (JIRA) Sun, 01 Nov 2015 04:32:58 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984385#comment-14984385
 ]


Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

[~vinodkv], attaching an initial patch. Kindly review.

This patch primarily does the following :
# When an unusable node is reported, task attempt kill events are sent for 
completed and running map tasks which ran on the node. A flag has been added in 
this event to indicate whether next task attempt will be rescheduled(scheduled 
with higher priority of 5). On unusable node it has been marked to be 
rescheduled. If a task attempt is killed by client, it will not be rescheduled 
with higher priority. I am not a 100% convinced if user initiated kill should 
lead to a higher priority. Your thoughts on this ?
# Anyways, this rescheduled flag  is then forwarded to Tasklmpl in attempt 
killed event after killing of the attempt is complete.
# Based on this flag task will then create a new attempt and send a 
TA_RESCHEDULE or TA_SCHEDULE event on processing attempt kill event. As it is a 
kill event, its not counted towards failed attempt. Anyways. if attempt has to 
be rescheduled, TaskAttemptImpl will send a container request event to 
RMContainerAllocator. From here on, this will be treated like a failed map and 
hence priority will be 5. Like for failed maps, node or rack locality is not 
ensured. Node locality anyways cannot be ensured till node comes up.
# As on recovery, we only consider SUCCESSFUL tasks, I think we need not update 
this flag in history file. 


> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>         Attachments: MAPREDUCE-6513.01.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to