On Tue, 29 Mar 2011 23:17:13 +0530 Harsh J <[email protected]> wrote:
> Hello, > > On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck > <[email protected]> wrote: > > Hi, I'm using the streaming API and I notice my reducer gets - in > > the same invocation - a bunch of different keys, and I wonder why. > > I would expect to get one key per reducer run, as with the "normal" > > hadoop. > > > > Is this to limit the amount of spawned processes, assuming creating > > and destroying processes is usually expensive compared to the > > amount of work they'll need to do (not much, if you have many keys > > with each a handful of values)? > > > > OTOH if you have a high number of values over a small number of > > keys, I would rather stick to one-key-per-reducer-invocation, then > > I don't need to worry about supporting (and allocating memory for) > > multiple input keys. Is there a config setting to enable such > > behavior? > > > > Maybe I'm missing something, but this seems like a big difference in > > comparison to the default way of working, and should maybe be added > > to the FAQ at > > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions > > > > thanks, > > Dieter > > > > I think it would make more sense to think of streaming programs as > complete map/reduce 'tasks', instead of trying to apply the Map/Reduce > functional concept. Both of the programs need to be written from the > reading level onwards, which in map's case each line is record input > and in reduce's case it is one uniquely grouped key and all values > associated to it. One would need to handle the reading-loop > themselves. > > Some non-Java libraries that provide abstractions atop the > streaming/etc. layer allow for more fluent representations of the > map() and reduce() functions, hiding away the other fine details (like > the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce > programs, for example. > > A FAQ entry on this should do good too! You can file a ticket for an > addition of this observation to the streaming docs' FAQ. > > [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial > Thanks, this makes it a little clearer. I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410 Dieter
