Guojun is right, the reduce() inputs are buffered and read off of disk. You are in no danger there.
On Fri, Jun 29, 2012 at 11:02 PM, GUOJUN Zhu <guojun_...@freddiemac.com>wrote: > > If you are referring the iterable in the reducer, they are special and not > in the memory at all. Once the iterator pass a value, it is lost and you > cannot recover it. There is nothing like linkedlist in behind. > > Zhu, Guojun > Modeling Sr Graduate > 571-3824370 > guojun_...@freddiemac.com > Financial Engineering > Freddie Mac > > > *"Berry, Matt" <mwbe...@amazon.com>* > > 06/29/2012 01:06 PM > Please respond to > mapreduce-user@hadoop.apache.org > > To > "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org> > cc > Subject > RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing > > > > > I was actually quite curious as to how Hadoop was managing to get all of > the records into the Iterable in the first place. I thought they were using > a very specialized object that implements Iterable, but a heap dump shows > they're likely just using a LinkedList. All I was doing was duplicating > that object. Supposing I do as you suggest, am I in danger of having their > list consume all the memory if a user decides to log 2x or 3x as much as > they did this time? > > ~Matt > > -----Original Message----- > From: Harsh J [mailto:ha...@cloudera.com] > Sent: Friday, June 29, 2012 6:52 AM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while > reducing > > Hey Matt, > > As far as I can tell, Hadoop isn't at fault here truly. > > If your issue is that you collect in a list before you store, you should > focus on that and just avoid collecting it completely. Why don't you > serialize as you receive, if the incoming order is already taken care of? > As far as I can tell, your AggregateRecords probably does nothing else but > serialize the stored LinkedList. So instead of using a LinkedList, or even > a composed Writable such as AggregateRecords, just write them in as you > receive them via each .next(). Would this not work for you? You may batch a > constant bit to gain some write performance but at least you won't have to > use up your memory. > > You can serialize as you receive by following this: > > http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F > > > -- > Harsh J > <http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F> > -- Harsh J