[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274118#comment-15274118
 ] 

Jason Lowe commented on MAPREDUCE-6689:
---------------------------------------

Sorry for arriving late to the discussion.

bq. Should we really be ramping up if we have hanging map requests irrespective 
of configuration value of reduce rampup limit ?

Ramping up reducers when maps are hanging does sound a bit dubious, but it may 
make sense in some scenarios.  Consider a case where a job issues tons of maps, 
far more than the queue can handle.  Some of those maps are going to appear to 
be hanging for a very long time because they have to run in multiple waves.  
The whole point of ramping up reducers before the maps are complete is to try 
to reduce job latency (at the expense of overall cluster throughput) by 
pipelining the shuffle of the completed tasks with the remaining map tasks.  If 
the job has tons of data to shuffle for each map then it may make sense to 
sacrifice some of the map resources to get the reducers running early so they 
can start chewing on the horde of completed map output.  It all depends upon 
the map durations, the shuffle burden, etc.  It is definitely safer from a 
correctness point of view to avoid ramping up reducers if there are any hanging 
maps at all, but I believe there could be some jobs whose latency could 
increase as a result of that change.

I'm guessing the root cause of the issue is an incorrect headroom report, e.g.: 
there's technically enough free space in the headroom but it's fragmented 
across nodes in such a way that no single map can fit on any node.  The 
unconditional preemption logic from MAPREDUCE-6302 was supposed to address 
this, but it looks like the container allocator can quickly "forget" this 
decision and re-schedule the reducers that were shot.

> MapReduce job can infinitely increase number of reducer resource requests
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6689
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Blocker
>         Attachments: MAPREDUCE-6689.1.patch
>
>
> We have seen this issue from one of our clusters: when running terasort 
> map-reduce job, some mappers failed after reducer started, and then MR AM 
> tries to preempt reducers to schedule these failed mappers.
> After that, MR AM enters an infinite loop, for every 
> RMContainerAllocator#heartbeat run, it:
> - In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests. 
> (total scheduled reducers = 1024)
> - Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024).
> As a result, we can see total #requested-containers increased 1024 for every 
> MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so 
> we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.
> And this bug also triggered YARN-4844, which makes RM stop scheduling 
> anything.
> Thanks to [~sidharta-s] for helping with analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to