From the stack trace you provided, your OOM is probably due to HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized key in an outputted record exactly fills the serialization buffer that collects map outputs, causing an allocation as large as the size of that buffer. It causes an extra spill, an OOM exception if the task JVM has a max heap size too small to mask the bug, and will miss the combiner if you've defined one, but it won't drop records.

However, I was wondering: are these hard architectural limits? Say that I wanted to emit 25,000 maps for a single input record, would that mean that I will require huge amounts of (virtual) memory? In other words, what exactly is the reason that increasing the number of emitted maps per input record causes an OutOfMemoryError ?


Do you mean the number of output records per input record in the map? The memory allocated for collecting records out of the map is (mostly) fixed at the size defined in io.sort.mb. The ratio of input records to output records does not affect the collection and sort. The number of output records can sometimes influence the memory requirements, but not significantly. -C

Reply via email to