Hadoop - non disk based sorting?

Mingxi Wu Tue, 29 Nov 2011 15:45:20 -0800

Hi,

I have a question regarding the shuffle phase of reducer.


It appears when there are large map output (in my case, 5 billion records), I 
will have out of memory Error like below. 

Error: java.lang.OutOfMemoryError: Java heap space at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

However, I thought the shuffling phase is using disk-based sort, which is not 
constraint by memory. 
So, why will user run into this outofmemory error? After I increased my number 
of reducers from 100 to 200, the problem went away. 

Any input regarding this memory issue would be appreciated!

Thanks,

Mingxi

Hadoop - non disk based sorting?

Reply via email to