Wangda Tan created MAPREDUCE-6689:
-------------------------------------

             Summary: MapReduce job can infinitely increasing number of reducer 
resource requests
                 Key: MAPREDUCE-6689
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Wangda Tan
            Assignee: Wangda Tan
            Priority: Blocker


We have seen this issue from one of our clusters: when running terasort 
map-reduce job, some mappers failed after reducer started, and then MR AM tries 
to preempt reducers to schedule these failed mappers.

After that, MR AM enters an infinite loop, for every 
RMContainerAllocator#heartbeat run, it:

- In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests. 
(total scheduled reducers = 1024)
- Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024).

As a result, we can see total #requested-containers increased 1024 for every 
MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so we 
get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.

And this bug also triggered YARN-4844, which makes RM stop scheduling anything.

Thanks to [~sidharta-s] for helping with analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to