[ https://issues.apache.org/jira/browse/MAPREDUCE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272988#comment-15272988 ]
Wangda Tan commented on MAPREDUCE-6689: --------------------------------------- Uploaded patch for this (on top of MAPREDUCE-6514) > MapReduce job can infinitely increase number of reducer resource requests > ------------------------------------------------------------------------- > > Key: MAPREDUCE-6689 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Blocker > Attachments: MAPREDUCE-6689.1.patch > > > We have seen this issue from one of our clusters: when running terasort > map-reduce job, some mappers failed after reducer started, and then MR AM > tries to preempt reducers to schedule these failed mappers. > After that, MR AM enters an infinite loop, for every > RMContainerAllocator#heartbeat run, it: > - In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests. > (total scheduled reducers = 1024) > - Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024). > As a result, we can see total #requested-containers increased 1024 for every > MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so > we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side. > And this bug also triggered YARN-4844, which makes RM stop scheduling > anything. > Thanks to [~sidharta-s] for helping with analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org