Hello,

On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
<[email protected]> wrote:
> Hi, I'm using the streaming API and I notice my reducer gets - in the same
> invocation - a bunch of different keys, and I wonder why.
> I would expect to get one key per reducer run, as with the "normal"
> hadoop.
>
> Is this to limit the amount of spawned processes, assuming creating and
> destroying processes is usually expensive compared to the amount of
> work they'll need to do (not much, if you have many keys with each a
> handful of values)?
>
> OTOH if you have a high number of values over a small number of keys, I
> would rather stick to one-key-per-reducer-invocation, then I don't need
> to worry about supporting (and allocating memory for) multiple input
> keys.  Is there a config setting to enable such behavior?
>
> Maybe I'm missing something, but this seems like a big difference in
> comparison to the default way of working, and should maybe be added to
> the FAQ at
> http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
>
> thanks,
> Dieter
>

I think it would make more sense to think of streaming programs as
complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
functional concept. Both of the programs need to be written from the
reading level onwards, which in map's case each line is record input
and in reduce's case it is one uniquely grouped key and all values
associated to it. One would need to handle the reading-loop
themselves.

Some non-Java libraries that provide abstractions atop the
streaming/etc. layer allow for more fluent representations of the
map() and reduce() functions, hiding away the other fine details (like
the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
programs, for example.

A FAQ entry on this should do good too! You can file a ticket for an
addition of this observation to the streaming docs' FAQ.

[1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial

-- 
Harsh J
http://harshj.com

Reply via email to