Hi, I have a mapred job that has about 60 million input records, and groups them into 1 or 2 element unit (that is a reducer always gets 1 or 2 records with the same key).
I have 2Gb of RAM set up for each map/reduce task and some of the reduce tasks fail with OutOfMemoryError. I've got a dump of one of the reduce task when it was close to OOM. It turned out that most of memory is consumed by the org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread class that holds org.apache.hadoop.mapred.Merger$Segment objects that are still reachable (there are about 170 of them taking about 8Mb or retained size each). Unfortunately, I'm not an expert in hadoop code, so I can't tell whether it's normal behavior or not. However, the common sense tells me that memory consumption is a bit too high. Do you have any ideas/thoughts about the described issue? Any pointers are highly appreciated Vyacheslav
