Hi, I have a question regarding the shuffle phase of reducer.
It appears when there are large map output (in my case, 5 billion records), I will have out of memory Error like below. Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) However, I thought the shuffling phase is using disk-based sort, which is not constraint by memory. So, why will user run into this outofmemory error? After I increased my number of reducers from 100 to 200, the problem went away. Any input regarding this memory issue would be appreciated! Thanks, Mingxi