[
https://issues.apache.org/jira/browse/MAPREDUCE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274118#comment-15274118
]
Jason Lowe commented on MAPREDUCE-6689:
---------------------------------------
Sorry for arriving late to the discussion.
bq. Should we really be ramping up if we have hanging map requests irrespective
of configuration value of reduce rampup limit ?
Ramping up reducers when maps are hanging does sound a bit dubious, but it may
make sense in some scenarios. Consider a case where a job issues tons of maps,
far more than the queue can handle. Some of those maps are going to appear to
be hanging for a very long time because they have to run in multiple waves.
The whole point of ramping up reducers before the maps are complete is to try
to reduce job latency (at the expense of overall cluster throughput) by
pipelining the shuffle of the completed tasks with the remaining map tasks. If
the job has tons of data to shuffle for each map then it may make sense to
sacrifice some of the map resources to get the reducers running early so they
can start chewing on the horde of completed map output. It all depends upon
the map durations, the shuffle burden, etc. It is definitely safer from a
correctness point of view to avoid ramping up reducers if there are any hanging
maps at all, but I believe there could be some jobs whose latency could
increase as a result of that change.
I'm guessing the root cause of the issue is an incorrect headroom report, e.g.:
there's technically enough free space in the headroom but it's fragmented
across nodes in such a way that no single map can fit on any node. The
unconditional preemption logic from MAPREDUCE-6302 was supposed to address
this, but it looks like the container allocator can quickly "forget" this
decision and re-schedule the reducers that were shot.
> MapReduce job can infinitely increase number of reducer resource requests
> -------------------------------------------------------------------------
>
> Key: MAPREDUCE-6689
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Priority: Blocker
> Attachments: MAPREDUCE-6689.1.patch
>
>
> We have seen this issue from one of our clusters: when running terasort
> map-reduce job, some mappers failed after reducer started, and then MR AM
> tries to preempt reducers to schedule these failed mappers.
> After that, MR AM enters an infinite loop, for every
> RMContainerAllocator#heartbeat run, it:
> - In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests.
> (total scheduled reducers = 1024)
> - Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024).
> As a result, we can see total #requested-containers increased 1024 for every
> MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so
> we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.
> And this bug also triggered YARN-4844, which makes RM stop scheduling
> anything.
> Thanks to [~sidharta-s] for helping with analysis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]