Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
 -- as of current version of pig, you need to use two different loads for
self join 

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results


I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)

-Thejas


On 6/29/10 11:37 PM, "diagnos...@email.com" <diagnos...@email.com> wrote:

> Hi 
> I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
> still trying to wrap my head around it. I'm stuck with putting together a
> query.
> 
> 
> A DUMP of my sample dataset is as follows,
> 
> 
> log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
> DUMP log;
> 
> 
> 
> (User1,a)
> (User1,b)
> (User2,f)
> (User3,b)
> (User2,a)
> (User1,e)
> (User2,b)
> (User2,c)
> (User3,d)
> (User1,d)
> (User2,e)
> (User2,a)
> (User3,c)
> (User1,d)
> (User2,c)
> (User3,a)
> (User1,d)
> (User2,b)
> (User2,e)
> (User3,c)
> 
> 
> What I'm trying to do is to say, Users visiting page 'a' also visited this
> list of other pages ranked by number of times the page was visited. Can anyone
> help or give me some guidance?
> 
> 
> Thanks
> Leslie
> 
> 
> 
> 
> 

Reply via email to