Re: Squeezing multiple datapoints out of one input line?

Chris Curtin Wed, 01 Jul 2009 05:55:04 -0700

Hi Eric,

You can output multiple name/value pairs from a mapper, then reduce on each
grouping.


But if you want nested or multiple map reduce jobs, take a look at Cascading
(www.cascading.org). I have written several applications where I take my
inputs, use a map function to clean up/calculate some new values then pass
it to dozens of reducer functions where what I'm reducing on is different in
each. Cool thing is I don't worry about how to get hadoop to know of the
dependency or which can be run in parallel. Cascading takes care of this.

Chris

On Wed, Jul 1, 2009 at 3:23 AM, Erik Forsberg <[email protected]> wrote:

> Hi!
>
> I'm fairly new to Hadoop and the world of MapReduce. I think I've
> managed to understand the basics, but one thing I'm having trouble to
> understand is how to efficiently get multiple datapoints out of a
> single input line.
>
> I'm thinking of cases like analysis of Apache log lines where I may
> want to produce:
>
>  *) Geolocation-based stats for requests based on connecting IP.
>  *) Top URLs info.
>  *) Count of unique users based on mod_usertrack info (unique
>    identifier for each user).
>
> ..and possibly some combinations, like "Top URLs by geolocation
> country".
>
> Most simple MapReduce examples read one input line at a time, and emit
> one key/value pair. I can see how that works great if you want to
> create for example only the Top URLs, but I'm having trouble
> understanding how to efficiently do what I want to do.
>
> Running against the same set of input data multiple times feels like a
> naive but very inefficient way to solve the problem. There must be
> better ways?
>
> Pig seems to be able to do this somehow, correct? How does it work
> behind the scenes? (or should I ask on the Pig list?)
>
> I think I read somewhere that you could have multiple named output
> channels from mappers, which could then be read by the
> combiners/reducers, but now I can't find it. Any ideas what I'm talking
> about?
>
> Would writing to Task Side-Effect files then running new MR jobs on the
> output be a viable option?
> (
> http://hadoop.apache.org/core/docs/r0.18.3/mapred_tutorial.html#Task+Side-Effect+Files
> )
> That only works if you Mappers and Reducers are written in Java, not
> with Streaming/scripting languages, correct?
>
> Any input, pointers to FAQ's I've missed, etc. would be much
> appreciated.
>
> Thanks,
> \EF
> --
> Erik Forsberg <[email protected]>
>

Re: Squeezing multiple datapoints out of one input line?

Reply via email to