[
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984385#comment-14984385
]
Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------
[~vinodkv], attaching an initial patch. Kindly review.
This patch primarily does the following :
# When an unusable node is reported, task attempt kill events are sent for
completed and running map tasks which ran on the node. A flag has been added in
this event to indicate whether next task attempt will be rescheduled(scheduled
with higher priority of 5). On unusable node it has been marked to be
rescheduled. If a task attempt is killed by client, it will not be rescheduled
with higher priority. I am not a 100% convinced if user initiated kill should
lead to a higher priority. Your thoughts on this ?
# Anyways, this rescheduled flag is then forwarded to Tasklmpl in attempt
killed event after killing of the attempt is complete.
# Based on this flag task will then create a new attempt and send a
TA_RESCHEDULE or TA_SCHEDULE event on processing attempt kill event. As it is a
kill event, its not counted towards failed attempt. Anyways. if attempt has to
be rescheduled, TaskAttemptImpl will send a container request event to
RMContainerAllocator. From here on, this will be treated like a failed map and
hence priority will be 5. Like for failed maps, node or rack locality is not
ensured. Node locality anyways cannot be ensured till node comes up.
# As on recovery, we only consider SUCCESSFUL tasks, I think we need not update
this flag in history file.
> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
> Key: MAPREDUCE-6513
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, resourcemanager
> Affects Versions: 2.7.0
> Reporter: Bob
> Assignee: Varun Saxena
> Priority: Critical
> Attachments: MAPREDUCE-6513.01.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable
> due to some OS issue.After the node became unstable, the map on this node
> status changed to KILLED state.
> Currently maps which were running on unstable node are rescheduled, and all
> are in scheduled state and wait for RM assign container.Seen ask requests for
> map till Node is good (all those failed), there are no ask request after
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)