[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Rohith Sharma K S (JIRA) Fri, 16 Oct 2015 11:10:22 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961113#comment-14961113
 ]


Rohith Sharma K S commented on MAPREDUCE-6513:
----------------------------------------------

[~varun_saxena] thanks for your detailed analysis. 
>From the logs you extracted from previous your comment I see that Ramping up 
>of reducers is done nevertheless of scheduledMaps is zero or greater than 
>zero. I think below code blindly should not ramp up the reducers
{code}
if (rampUp > 0) {
      rampUp = Math.min(rampUp, numPendingReduces);
      LOG.info("Ramping up " + rampUp);
      rampUpReduces(rampUp);
    }
{code}

I think checking for {{scheduledMaps==0}} while ramping up should avoid the 
issue nevertheless of mapper priority. But again questions is what if 
schduledMaps are failed maps attempts? To handle this better way is check for 
all scheduledMaps priority. If all the scheduledMaps priority is less than 
reducers, then ramping up can be done.
{code}
// if scheduledMaps is non ZERO then neverthless of mapper priority do not ramp 
up reducers.
if (rampUp > 0 && scheduledMaps == 0) {
      rampUp = Math.min(rampUp, numPendingReduces);
      LOG.info("Ramping up " + rampUp);
      rampUpReduces(rampUp);
    }
{code}

Any thoughts?

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to