I have a job that uses an identity mapper and the same code for both the combiner and the reducer. In a small percentage of combiner tasks, after a few seconds I get errors that look like this:
FATAL mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:781) org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:524) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Those tasks fail, though then subsequently restart and complete successfully. Eventually the whole job completes successfully. Nevertheless this happens consistently enough that it is clearly a problem with my code rather than a transient glitch on my cluster. >From the stack it looks like the out of memory error is happening before any of my combiner code has had a chance to run. If I don't specify a combiner class and run everything through reducers, there are no out of memory errors and everything works fine. Obviously I have a bug, but I'm wondering if anyone has seen this particular failure mode before and has insights into why it is happening. My hypothesis is that I have some memory usage within the combiner/reducer code that doesn't scale to the largest inputs my job is getting. This is a problem for combiners and not reducers because more combiners than reducers run on a single task tracker node. The problematic job is not the one that's failing during initialization but one that is running at the same time on the same node and chewing up all the memory. Does this hypothesis sound plausible?
