From the stack trace you provided, your OOM is probably due to
HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized
key in an outputted record exactly fills the serialization buffer that
collects map outputs, causing an allocation as large as the size of
that buffer. It causes an extra spill, an OOM exception if the task
JVM has a max heap size too small to mask the bug, and will miss the
combiner if you've defined one, but it won't drop records.
However, I was wondering: are these hard architectural limits? Say
that I wanted to emit 25,000 maps for a single input record, would
that mean that I will require huge amounts of (virtual) memory? In
other words, what exactly is the reason that increasing the number
of emitted maps per input record causes an OutOfMemoryError ?
Do you mean the number of output records per input record in the map?
The memory allocated for collecting records out of the map is (mostly)
fixed at the size defined in io.sort.mb. The ratio of input records to
output records does not affect the collection and sort. The number of
output records can sometimes influence the memory requirements, but
not significantly. -C