Hi, I'm using the streaming API and I notice my reducer gets - in the same invocation - a bunch of different keys, and I wonder why. I would expect to get one key per reducer run, as with the "normal" hadoop.
Is this to limit the amount of spawned processes, assuming creating and destroying processes is usually expensive compared to the amount of work they'll need to do (not much, if you have many keys with each a handful of values)? OTOH if you have a high number of values over a small number of keys, I would rather stick to one-key-per-reducer-invocation, then I don't need to worry about supporting (and allocating memory for) multiple input keys. Is there a config setting to enable such behavior? Maybe I'm missing something, but this seems like a big difference in comparison to the default way of working, and should maybe be added to the FAQ at http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions thanks, Dieter
