Thanks a lot for the response, Arun. Just curious how OutputCollector flushes key/value pair to disk: is the periodical flush based on time (like every couple of mins) or based on volumn (like every 100 key/value pair output). The size of map output varies for each key/value input, it could be as small as one key/value pair output or as big as tens of millions of key/value pairs. I could try to change the way my application works to avoid this problem, but I am wondering if the hadoop already supports the scalability in such case besides increasing memeory?
Thanks, Eric Zhang Vespa content @Yahoo! Work: 408-349-2466 -----Original Message----- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Monday, August 20, 2007 12:58 PM To: [email protected] Subject: Re: how to deal with large amount of key value pair outputs in one run of map task Eric, On Mon, Aug 20, 2007 at 12:31:23PM -0700, Eric Zhang wrote: >Hi, >I have a hadoop application where each run of the map could potentially >generate large amount of key value pairs, so it caused the run of >memory error. I am wondering if there is a way to inform hadoop to >write the key value pairs to disk periodically? > The standard OutputCollector already does sort and flush key/value pairs to disk periodically... clearly you could see memory-related issues during sort etc. What is the observed size of map outputs? Try increasing the child-jvm memory limit via mapred.child.java.opts (default is 200M). Arun >thanks, > >Eric Zhang >Vespa content @Yahoo! >Work: 408-349-2466 > >
