On Aug 20, 2007, at 2:05 PM, Eric Zhang wrote:

Thanks a lot for the response, Arun. Just curious how OutputCollector
flushes key/value pair to disk: is the periodical flush based on time (like every couple of mins) or based on volumn (like every 100 key/value pair
output).
The size of map output varies for each key/value input, it could be as small as one key/value pair output or as big as tens of millions of key/ value pairs. I could try to change the way my application works to avoid this problem, but I am wondering if the hadoop already supports the scalability
in such case besides increasing memeory?

It uses io.sort.mb, which is the number of megabytes to keep before you sort and spill to disk. (The config variable was named back when the sort was being handled very differently, and thus the unobvious name.) A major point of map/reduce is to scale to very large data sets and make very few assumptions about what will fit in memory at once.

-- Owen

Reply via email to