Thanks everyone for the help. Emitting each record individually from the 
reducer is working well, and I can still aggregate the needed information as I 
go.

From: Harsh J [mailto:ha...@cloudera.com]
Sent: Friday, June 29, 2012 9:40 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Guojun is right, the reduce() inputs are buffered and read off of disk. You are 
in no danger there.
On Fri, Jun 29, 2012 at 11:02 PM, GUOJUN Zhu 
<guojun_...@freddiemac.com<mailto:guojun_...@freddiemac.com>> wrote:

If you are referring the iterable in the reducer, they are special and not in 
the memory at all.  Once the iterator pass a value, it is lost and you cannot 
recover it.  There is nothing like linkedlist in behind.

Zhu, Guojun
Modeling Sr Graduate
571-3824370<tel:571-3824370>
guojun_...@freddiemac.com<mailto:guojun_...@freddiemac.com>
Financial Engineering
Freddie Mac

   "Berry, Matt" <mwbe...@amazon.com<mailto:mwbe...@amazon.com>>

   06/29/2012 01:06 PM
   Please respond to
mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>


To

"mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>" 
<mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>>

cc

Subject

RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing







I was actually quite curious as to how Hadoop was managing to get all of the 
records into the Iterable in the first place. I thought they were using a very 
specialized object that implements Iterable, but a heap dump shows they're 
likely  just using a LinkedList. All I was doing was duplicating that object. 
Supposing I do as you suggest, am I in danger of having their list consume all 
the memory if a user decides to log 2x or 3x as much as they did this time?

~Matt

-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com<mailto:ha...@cloudera.com>]
Sent: Friday, June 29, 2012 6:52 AM
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing

Hey Matt,

As far as I can tell, Hadoop isn't at fault here truly.

If your issue is that you collect in a list before you store, you should focus 
on that and just avoid collecting it completely. Why don't you serialize as you 
receive, if the incoming order is already taken care of? As far as I can tell, 
your AggregateRecords probably does nothing else but serialize the stored 
LinkedList. So instead of using a LinkedList, or even a composed Writable such 
as AggregateRecords, just write them in as you receive them via each .next(). 
Would this not work for you? You may batch a constant bit to gain some write 
performance but at least you won't have to use up your memory.

You can serialize as you receive by following this:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F


--
Harsh J
<http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F>



--
Harsh J

Reply via email to