Hi, I'm using the streaming API and I notice my reducer gets - in the same
invocation - a bunch of different keys, and I wonder why.
I would expect to get one key per reducer run, as with the "normal"
hadoop.

Is this to limit the amount of spawned processes, assuming creating and
destroying processes is usually expensive compared to the amount of
work they'll need to do (not much, if you have many keys with each a
handful of values)?

OTOH if you have a high number of values over a small number of keys, I
would rather stick to one-key-per-reducer-invocation, then I don't need
to worry about supporting (and allocating memory for) multiple input
keys.  Is there a config setting to enable such behavior?

Maybe I'm missing something, but this seems like a big difference in
comparison to the default way of working, and should maybe be added to
the FAQ at
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

thanks,
Dieter

Reply via email to