[
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148409#comment-16148409
]
Wilfred Spiegelenburg commented on MAPREDUCE-6447:
--------------------------------------------------
[~prateekrungta] and [~shuzhangyao]: I have seen this same issue a number of
times and people keep referring to this open MR issue.
I have dug into this and have found that there is nothing wrong with the
calculation and there is no need for a change in the way we handle this in the
code. There is no guarantee that any method devised for the internal
calculation will guarantee that you do not get an OOM error. In all cases I
have run into I have been able to fix it with a configuration change in MR and
the JVM.
Let me explain what I have found and why the issue will not be solved by
changing the internal calculations.
When the JVM throws an OOM for a reducer I collected a heap dumps and looked at
what was allocated at the point in time that the OOM was thrown. It most cases
the OOM was not thrown due to the total heap being used. As an example: the JVM
heap for the reducer was set to 9216MB or 9GB the heap dump showed only a 5896M
heap usage. Looking at the usage of the heap it showed that the shuffle input
buffer usage was well within its limits.
We then tried to lower the {{mapreduce.reduce.shuffle.input.buffer.percent}}
from the default 0.9 to 0.6 and found that it did not solve the issue. There
was still an OOM around the same point with approximately the same usage of the
heap. Lowering it further to 0.4 allowed the job to finish but we saw that the
JVM never peaked above about 60% of the assigned heap. This causes a lot of
waste on the cluster and is thus not a solution we could accept.
Further checks of the GC logging showed that all heap usage was in the old
generation for each of the OOM cases. That explains why the reducer was
throwing an OOM and the heap dump: it had run out of space in the old
generation, not the total heap. Within the heap the old generation can take
about 2/3 of the total heap. This is based on the default settings for the
generation sizing in the heap [1].
The question then became what caused the JVM to run out of old generation and
not using its young generation? This could be explained by the logging from the
reducer we had. The reducer logs showed that it was trying to allocate a large
shuffle response. In my case about 1.9GB. Eventhough this is a large shuffle
response it was within all the limit. The JVM will allocate large objects often
directly in the old generation instead of allocating it in the young
generation. This behaviour could cause an OOM error in the reducer while not
using the full heap and just running out of old generation.
Back to the calculations. In the buffer we load all the shuffle data but we set
a maximum of 25% of the total buffer in one shuffle response. This is the in
memory merge limit. If the shuffle response is larger than 25% of the buffer
size we do not store it in the buffer but directly merge to disk. A shuffle
response is only accepted and downloaded if we can fit it in memory or if it
goes straight to disk. The check and increase of the buffer usage happens
before we start the download. Locking makes sure only one thread does this at a
time, the number of paralel copies is thus not important. Limiting could lead
to a deadlock as explained in the comments in the code. Since we need to
prevent deadlocks we allow one shuffle (one thread) to go over that limit. If
we would not allow this we could deadlock the reducer. The reducer would be in
a state that it can not download new data to reduce. There would also never be
a trigger that would cause the data in memory to be merged/spilled and thus the
buffer stays as full as it is.
Based on all that the maximum size of all the data we could store in the
shuffle buffer would be:
{code}
mapreduce.reduce.shuffle.input.buffer.percent = buffer% = 70%
mapreduce.reduce.shuffle.memory.limit.percent = limit% = 25%
heap size = 9GB
maximum used memory = ((buffer% * (1 + limit%)) * heap size) - 1 byte
{code}
If that buffer does not fit in the old generation we could throw an OOM error
without really running out of memory. This is especially true when the
individual shuffle sizes are large but not hit the in memory limit. Everything
is still properly calculated and limited. We also do not unknowningly use more
than the configured buffer size. If we go over we know exactly how much.
The way we worked around the problem without increasing the size of the heap.
We did this by changing the generations. The old generation inside the heap was
changed by increasing the "NewRatio" setting from 2 (default) to 4. We also
changed the "input.buffer.percent" setting to 65%. That worked in our case with
the 9GB as the maximum heap for the reducer. Different heap sizes combined with
a different "memory.limit.percent" might require a different "Newratio" setting.
[1]
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html
> reduce shuffle throws "java.lang.OutOfMemoryError: Java heap space"
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-6447
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
> Reporter: shuzhangyao
> Assignee: shuzhangyao
> Priority: Minor
>
> 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child :
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
> shuffle in fetcher#10
> at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
> at
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
> at
> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
> at
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
> at
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]