Re: Squeezing multiple datapoints out of one input line?

Erik Forsberg Wed, 08 Jul 2009 00:45:06 -0700

On Wed, 1 Jul 2009 11:11:26 -0700
Ted Dunning <[email protected]> wrote:


Thanks for your reply, it got me thinking in new interesting ways :-)

> For getting the unique counts, you would also emit two additional
> lines in the time unit loop:
> 
>        output.collect( [GEO_UNIQUE, timeUnit, t,
> logLineGeoLocationCode], logLineUserId )
>        output.collect( [URL_UNIQUE, timeUnit, t, logLineUrl],
> logLineUserId )
> 
> Now your combiner needs to switch behavior slightly.  For a key that
> starts with GEO or URL, it should behave as before.  For a key that
> starts with GEO_UNIQUE or URL_UNIQUE, it should accumulate a list of
> unique id's.

OK. If I understand correctly, this would give me the following as
input to the reducer:

[URL_UNIQUE, timeUnit, t, logLineURL1], <logLineUserId1,
logLineUserId2, ..., LoglineUser1, LoglineUserN, ...>

i.e. a list of all LoglineUser IDs that have accessed the logLineUrl.
And, as stated in my example, if user LoglineUser1 hit logLineUrl1
twice, there will be two entries in the list. Correct?

Now, if I want to count the unique users hitting a specific logLineUrl,
I'm thinking that I would have to run a second MR job on the output.
So, the reducer of the first MR job would output, to a
timeunit-specific output file:

logLineUserId1,logLineURL1
logLineUserId2,logLineURL1
...
logLineUserId1,logLineURL1
logLineUserIdN,logLineURL1
logLineUserIdX,logLineURL2
..etc..

The second MR job would take this data as input, use the Identity
mapper, then reduce so that there's only one entry per
logLineUser,logLineURL pair. Finally, a third job could be used to
count the number of unique users per logLineURL.

Does this sound like the way to do it, or am I overcomplicating things? 

> As you mention, Pig does this automagically.  Indeed, doing much of
> this in Java leads to *really* nasty programs that are hard to
> maintain.  Pig does this on the fly, however, and doesn't require
> that you look at the result. Indeed, this kind of tranform is exactly
> what makes Pig (and Jaql and Cascading) higher order as well as
> higher level.

Uhum.. well, I'm just wrapping my head around the MR paradigm and
Hadoop. Having to learn Pig and/or Cascading on top of that is a
challenge. It might be worth, it, though.

Thanks,
\EF

Re: Squeezing multiple datapoints out of one input line?

Reply via email to