[
https://issues.apache.org/jira/browse/MAPREDUCE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274416#comment-15274416
]
Varun Saxena commented on MAPREDUCE-6689:
-----------------------------------------
Agree.
If we do introduce a config to decide whether maps have been starved or not(and
hence not ramp up reducers), this will have to be tuned according to job(not
only based on type of job but even the size of data it processes in each run.
And several other factors).
I do see that it will be almost impossible to accurately decide a correct value
for such a config.
We do have fix made in MAPREDUCE-6514 in our private branch since several
months. But do not yet have MAPREDUCE-6302 in.
Let us see how the recent fixes alongwith MAPREDUCE-6302 go on a real cluster.
I think it should cover most of the scenarios.
> MapReduce job can infinitely increase number of reducer resource requests
> -------------------------------------------------------------------------
>
> Key: MAPREDUCE-6689
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Priority: Blocker
> Attachments: MAPREDUCE-6689.1.patch
>
>
> We have seen this issue from one of our clusters: when running terasort
> map-reduce job, some mappers failed after reducer started, and then MR AM
> tries to preempt reducers to schedule these failed mappers.
> After that, MR AM enters an infinite loop, for every
> RMContainerAllocator#heartbeat run, it:
> - In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests.
> (total scheduled reducers = 1024)
> - Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024).
> As a result, we can see total #requested-containers increased 1024 for every
> MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so
> we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.
> And this bug also triggered YARN-4844, which makes RM stop scheduling
> anything.
> Thanks to [~sidharta-s] for helping with analysis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]