[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148409#comment-16148409
 ] 

Wilfred Spiegelenburg commented on MAPREDUCE-6447:
--------------------------------------------------

[~prateekrungta] and [~shuzhangyao]: I have seen this same issue a number of 
times and people keep referring to this open MR issue.
I have dug into this and have found that there is nothing wrong with the 
calculation and there is no need for a change in the way we handle this in the 
code. There is no guarantee that any method devised for the internal 
calculation will guarantee that you do not get an OOM error. In all cases I 
have run into I have been able to fix it with a configuration change in MR and 
the JVM.

Let me explain what I have found and why the issue will not be solved by 
changing the internal calculations.

When the JVM throws an OOM for a reducer I collected a heap dumps and looked at 
what was allocated at the point in time that the OOM was thrown. It most cases 
the OOM was not thrown due to the total heap being used. As an example: the JVM 
heap for the reducer was set to 9216MB or 9GB the heap dump showed only a 5896M 
heap usage. Looking at the usage of the heap it showed that the shuffle input 
buffer usage was well within its limits.
We then tried to lower the {{mapreduce.reduce.shuffle.input.buffer.percent}} 
from the default 0.9 to 0.6 and found that it did not solve the issue. There 
was still an OOM around the same point with approximately the same usage of the 
heap. Lowering it further to 0.4 allowed the job to finish but we saw that the 
JVM never peaked above about 60% of the assigned heap. This causes a lot of 
waste on the cluster and is thus not a solution we could accept.
Further checks of the GC logging showed that all heap usage was in the old 
generation for each of the OOM cases. That explains why the reducer was 
throwing an OOM and the heap dump: it had run out of space in the old 
generation, not the total heap. Within the heap the old generation can take 
about 2/3 of the total heap. This is based on the default settings for the 
generation sizing in the heap [1].

The question then became what caused the JVM to run out of old generation and 
not using its young generation? This could be explained by the logging from the 
reducer we had. The reducer logs showed that it was trying to allocate a large 
shuffle response. In my case about 1.9GB. Eventhough this is a large shuffle 
response it was within all the limit. The JVM will allocate large objects often 
directly in the old generation instead of allocating it in the young 
generation. This behaviour could cause an OOM error in the reducer while not 
using the full heap and just running out of old generation.

Back to the calculations. In the buffer we load all the shuffle data but we set 
a maximum of 25% of the total buffer in one shuffle response. This is the in 
memory merge limit. If the shuffle response is larger than 25% of the buffer 
size we do not store it in the buffer but directly merge to disk. A shuffle 
response is only accepted and downloaded if we can fit it in memory or if it 
goes straight to disk. The check and increase of the buffer usage happens 
before we start the download. Locking makes sure only one thread does this at a 
time, the number of paralel copies is thus not important. Limiting could lead 
to a deadlock as explained in the comments in the code. Since we need to 
prevent deadlocks we allow one shuffle (one thread) to go over that limit. If 
we would not allow this we could deadlock the reducer. The reducer would be in 
a state that it can not download new data to reduce. There would also never be 
a trigger that would cause the data in memory to be merged/spilled and thus the 
buffer stays as full as it is.

Based on all that the maximum size of all the data we could store in the 
shuffle buffer would be:
{code}
mapreduce.reduce.shuffle.input.buffer.percent = buffer% = 70%
mapreduce.reduce.shuffle.memory.limit.percent = limit% = 25%
heap size = 9GB
maximum used memory = ((buffer% * (1 + limit%)) * heap size) - 1 byte
{code}
If that buffer does not fit in the old generation we could throw an OOM error 
without really running out of memory. This is especially true when the 
individual shuffle sizes are large but not hit the in memory limit. Everything 
is still properly calculated and limited. We also do not unknowningly use more 
than the configured buffer size. If we go over we know exactly how much.

The way we worked around the problem without increasing the size of the heap. 
We did this by changing the generations. The old generation inside the heap was 
changed by increasing the "NewRatio" setting from 2 (default) to 4. We also 
changed the "input.buffer.percent" setting to 65%. That worked in our case with 
the 9GB as the maximum heap for the reducer. Different heap sizes combined with 
a different "memory.limit.percent" might require a different "Newratio" setting.

[1] 
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html

> reduce shuffle throws "java.lang.OutOfMemoryError: Java heap space"
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6447
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
>            Reporter: shuzhangyao
>            Assignee: shuzhangyao
>            Priority: Minor
>
> 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#10
>       at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
>       at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to