On Wed, 1 Jul 2009 11:11:26 -0700 Ted Dunning <[email protected]> wrote:
Thanks for your reply, it got me thinking in new interesting ways :-) > For getting the unique counts, you would also emit two additional > lines in the time unit loop: > > output.collect( [GEO_UNIQUE, timeUnit, t, > logLineGeoLocationCode], logLineUserId ) > output.collect( [URL_UNIQUE, timeUnit, t, logLineUrl], > logLineUserId ) > > Now your combiner needs to switch behavior slightly. For a key that > starts with GEO or URL, it should behave as before. For a key that > starts with GEO_UNIQUE or URL_UNIQUE, it should accumulate a list of > unique id's. OK. If I understand correctly, this would give me the following as input to the reducer: [URL_UNIQUE, timeUnit, t, logLineURL1], <logLineUserId1, logLineUserId2, ..., LoglineUser1, LoglineUserN, ...> i.e. a list of all LoglineUser IDs that have accessed the logLineUrl. And, as stated in my example, if user LoglineUser1 hit logLineUrl1 twice, there will be two entries in the list. Correct? Now, if I want to count the unique users hitting a specific logLineUrl, I'm thinking that I would have to run a second MR job on the output. So, the reducer of the first MR job would output, to a timeunit-specific output file: logLineUserId1,logLineURL1 logLineUserId2,logLineURL1 ... logLineUserId1,logLineURL1 logLineUserIdN,logLineURL1 logLineUserIdX,logLineURL2 ..etc.. The second MR job would take this data as input, use the Identity mapper, then reduce so that there's only one entry per logLineUser,logLineURL pair. Finally, a third job could be used to count the number of unique users per logLineURL. Does this sound like the way to do it, or am I overcomplicating things? > As you mention, Pig does this automagically. Indeed, doing much of > this in Java leads to *really* nasty programs that are hard to > maintain. Pig does this on the fly, however, and doesn't require > that you look at the result. Indeed, this kind of tranform is exactly > what makes Pig (and Jaql and Cascading) higher order as well as > higher level. Uhum.. well, I'm just wrapping my head around the MR paradigm and Hadoop. Having to learn Pig and/or Cascading on top of that is a challenge. It might be worth, it, though. Thanks, \EF
