# of keys per reducer invocation (streaming api)

Dieter Plaetinck Tue, 29 Mar 2011 07:59:15 -0700

Hi, I'm using the streaming API and I notice my reducer gets - in the same
invocation - a bunch of different keys, and I wonder why.
I would expect to get one key per reducer run, as with the "normal"
hadoop.


Is this to limit the amount of spawned processes, assuming creating and
destroying processes is usually expensive compared to the amount of
work they'll need to do (not much, if you have many keys with each a
handful of values)?

OTOH if you have a high number of values over a small number of keys, I
would rather stick to one-key-per-reducer-invocation, then I don't need
to worry about supporting (and allocating memory for) multiple input
keys.  Is there a config setting to enable such behavior?

Maybe I'm missing something, but this seems like a big difference in
comparison to the default way of working, and should maybe be added to
the FAQ at
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

thanks,
Dieter

# of keys per reducer invocation (streaming api)

Reply via email to