On Aug 20, 2007, at 2:05 PM, Eric Zhang wrote:
Thanks a lot for the response, Arun. Just curious how OutputCollector
flushes key/value pair to disk: is the periodical flush based on
time (like
every couple of mins) or based on volumn (like every 100 key/value
pair
output).
The size of map output varies for each key/value input, it could be
as small
as one key/value pair output or as big as tens of millions of key/
value
pairs. I could try to change the way my application works to avoid
this
problem, but I am wondering if the hadoop already supports the
scalability
in such case besides increasing memeory?
It uses io.sort.mb, which is the number of megabytes to keep before
you sort and spill to disk. (The config variable was named back when
the sort was being handled very differently, and thus the unobvious
name.) A major point of map/reduce is to scale to very large data
sets and make very few assumptions about what will fit in memory at
once.
-- Owen