Thanks, Owen. By configing mapred.child.java.opts to larger value (took a little while to figure out the right way to config it: -Xmx300m), the outofmemory problem went away. It's good know that the default value of io.sort.mb is set to 100M and my map task required about 300M heap size to run.
Eric Zhang Vespa content @Yahoo! Work: 408-349-2466 -----Original Message----- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 21, 2007 10:32 AM To: [email protected] Subject: Re: how to deal with large amount of key value pair outputs in one run of map task On Aug 20, 2007, at 2:05 PM, Eric Zhang wrote: > Thanks a lot for the response, Arun. Just curious how OutputCollector > flushes key/value pair to disk: is the periodical flush based on time > (like every couple of mins) or based on volumn (like every 100 > key/value pair output). > The size of map output varies for each key/value input, it could be as > small as one key/value pair output or as big as tens of millions of > key/ value pairs. I could try to change the way my application works > to avoid this problem, but I am wondering if the hadoop already > supports the scalability in such case besides increasing memeory? It uses io.sort.mb, which is the number of megabytes to keep before you sort and spill to disk. (The config variable was named back when the sort was being handled very differently, and thus the unobvious name.) A major point of map/reduce is to scale to very large data sets and make very few assumptions about what will fit in memory at once. -- Owen
