I was actually quite curious as to how Hadoop was managing to get all of the 
records into the Iterable in the first place. I thought they were using a very 
specialized object that implements Iterable, but a heap dump shows they're 
likely  just using a LinkedList. All I was doing was duplicating that object. 
Supposing I do as you suggest, am I in danger of having their list consume all 
the memory if a user decides to log 2x or 3x as much as they did this time?

~Matt

-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Friday, June 29, 2012 6:52 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Hey Matt,

As far as I can tell, Hadoop isn't at fault here truly.

If your issue is that you collect in a list before you store, you should focus 
on that and just avoid collecting it completely. Why don't you serialize as you 
receive, if the incoming order is already taken care of? As far as I can tell, 
your AggregateRecords probably does nothing else but serialize the stored 
LinkedList. So instead of using a LinkedList, or even a composed Writable such 
as AggregateRecords, just write them in as you receive them via each .next(). 
Would this not work for you? You may batch a constant bit to gain some write 
performance but at least you won't have to use up your memory.

You can serialize as you receive by following this:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F


--
Harsh J

Reply via email to