Hi!
I'm fairly new to Hadoop and the world of MapReduce. I think I've
managed to understand the basics, but one thing I'm having trouble to
understand is how to efficiently get multiple datapoints out of a
single input line.
I'm thinking of cases like analysis of Apache log lines where I may
want to produce:
*) Geolocation-based stats for requests based on connecting IP.
*) Top URLs info.
*) Count of unique users based on mod_usertrack info (unique
identifier for each user).
..and possibly some combinations, like "Top URLs by geolocation
country".
Most simple MapReduce examples read one input line at a time, and emit
one key/value pair. I can see how that works great if you want to
create for example only the Top URLs, but I'm having trouble
understanding how to efficiently do what I want to do.
Running against the same set of input data multiple times feels like a
naive but very inefficient way to solve the problem. There must be
better ways?
Pig seems to be able to do this somehow, correct? How does it work
behind the scenes? (or should I ask on the Pig list?)
I think I read somewhere that you could have multiple named output
channels from mappers, which could then be read by the
combiners/reducers, but now I can't find it. Any ideas what I'm talking
about?
Would writing to Task Side-Effect files then running new MR jobs on the
output be a viable option?
(http://hadoop.apache.org/core/docs/r0.18.3/mapred_tutorial.html#Task+Side-Effect+Files)
That only works if you Mappers and Reducers are written in Java, not
with Streaming/scripting languages, correct?
Any input, pointers to FAQ's I've missed, etc. would be much
appreciated.
Thanks,
\EF
--
Erik Forsberg <[email protected]>