RE: how to deal with large amount of key value pair outputs in one run of map task

Eric Zhang Mon, 20 Aug 2007 14:06:12 -0700

Thanks a lot for the response, Arun. Just curious how OutputCollector
flushes key/value pair to disk: is the periodical flush based on time (like
every couple of mins) or based on volumn (like every 100 key/value pair
output). 
The size of map output varies for each key/value input, it could be as small
as one key/value pair output or as big as tens of millions of key/value
pairs. I could try to change the way my application works to avoid this
problem, but I am wondering if the hadoop already supports the scalability
in such case besides increasing memeory?

Thanks,

Eric Zhang
Vespa content @Yahoo!
Work: 408-349-2466

-----Original Message-----
From: Arun C Murthy [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 12:58 PM
To: [email protected]
Subject: Re: how to deal with large amount of key value pair outputs in one
run of map task

Eric,

On Mon, Aug 20, 2007 at 12:31:23PM -0700, Eric Zhang wrote:
>Hi,
>I have a hadoop application where each run of the map could potentially 
>generate large amount of key value pairs,  so it caused the run of 
>memory error.  I am wondering if there is a way to inform hadoop to 
>write the key value pairs to disk  periodically?
> 

The standard OutputCollector already does sort and flush key/value pairs to
disk periodically... clearly you could see memory-related issues during sort
etc.

What is the observed size of map outputs? Try increasing the child-jvm
memory limit via mapred.child.java.opts (default is 200M).

Arun

>thanks,
> 
>Eric Zhang
>Vespa content @Yahoo!
>Work: 408-349-2466
> 
>

RE: how to deal with large amount of key value pair outputs in one run of map task

Reply via email to