[jira] [Updated] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Vinod Kumar Vavilapalli (JIRA) Tue, 05 Apr 2016 19:08:29 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinod Kumar Vavilapalli updated MAPREDUCE-6513:
-----------------------------------------------
    Status: Open  (was: Patch Available)

Tx for the update, [~varun_saxena]!

Apologies for missing your updated patch for this long!

(Reviewing an MR patch after a looong time!)

First up, the patch doesn't apply anymore, can you please update?

I tried to review it despite the conflicts, some comments:
 - The logic looks good overall! You are right that user initiated kill should 
not lead to a higher priority.
 - We want to be sure that existing semantics in RMContainerAllocator about 
failed-maps are really about task-attempts that need to be rescheduled and not 
just failed-maps. I briefly looked, but it will be good for you to also 
reverify!
 - TestTaskAttempt.java
    -- Most (all?) of code in can be reused between testContainerKillOnNew and 
testContainerKillOnUnassigned.
    -- Also in existing tests, we should leave rescheduleAttempt to be false 
except in the new one testKillMapTaskAfterSuccess. You have enough coverage 
elsewhere that we should simply drop these changes except for the new tests.
 - TestMRApp.java.testUpdatedNodes: Instead of checking for reschedule events, 
is it possible to explicitly check for the higher priority of the corresponding 
request?

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob.zhao
>            Assignee: Varun Saxena
>            Priority: Critical
>         Attachments: MAPREDUCE-6513.01.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to