Hi Eric, You can output multiple name/value pairs from a mapper, then reduce on each grouping.
But if you want nested or multiple map reduce jobs, take a look at Cascading (www.cascading.org). I have written several applications where I take my inputs, use a map function to clean up/calculate some new values then pass it to dozens of reducer functions where what I'm reducing on is different in each. Cool thing is I don't worry about how to get hadoop to know of the dependency or which can be run in parallel. Cascading takes care of this. Chris On Wed, Jul 1, 2009 at 3:23 AM, Erik Forsberg <[email protected]> wrote: > Hi! > > I'm fairly new to Hadoop and the world of MapReduce. I think I've > managed to understand the basics, but one thing I'm having trouble to > understand is how to efficiently get multiple datapoints out of a > single input line. > > I'm thinking of cases like analysis of Apache log lines where I may > want to produce: > > *) Geolocation-based stats for requests based on connecting IP. > *) Top URLs info. > *) Count of unique users based on mod_usertrack info (unique > identifier for each user). > > ..and possibly some combinations, like "Top URLs by geolocation > country". > > Most simple MapReduce examples read one input line at a time, and emit > one key/value pair. I can see how that works great if you want to > create for example only the Top URLs, but I'm having trouble > understanding how to efficiently do what I want to do. > > Running against the same set of input data multiple times feels like a > naive but very inefficient way to solve the problem. There must be > better ways? > > Pig seems to be able to do this somehow, correct? How does it work > behind the scenes? (or should I ask on the Pig list?) > > I think I read somewhere that you could have multiple named output > channels from mappers, which could then be read by the > combiners/reducers, but now I can't find it. Any ideas what I'm talking > about? > > Would writing to Task Side-Effect files then running new MR jobs on the > output be a viable option? > ( > http://hadoop.apache.org/core/docs/r0.18.3/mapred_tutorial.html#Task+Side-Effect+Files > ) > That only works if you Mappers and Reducers are written in Java, not > with Streaming/scripting languages, correct? > > Any input, pointers to FAQ's I've missed, etc. would be much > appreciated. > > Thanks, > \EF > -- > Erik Forsberg <[email protected]> >
