Does this do what you want ? - L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray); L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray); -- as of current version of pig, you need to use two different loads for self join
J = join L1 by page, L2 by page; -- self join on page F1 = foreach j generate L1::page as p1, L2::page as p2; G = group F1 by p1,p2; F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as visitcount; -- now you have the number of times user who visited p1 has visited p2 O = order F2 by p1, visitcount; dump O; -- you results I haven't checked the syntax of above query. One optimization you can do to reduce the output size of join, is to do a group-by on user,page , then generate the count. Then do self-join on that result, replace COUNT(F1) in F2(above) with SUM(F1.cnt) -Thejas On 6/29/10 11:37 PM, "diagnos...@email.com" <diagnos...@email.com> wrote: > Hi > I'm absolutely new with using Pig, only just picked it up like 3 days ago, and > still trying to wrap my head around it. I'm stuck with putting together a > query. > > > A DUMP of my sample dataset is as follows, > > > log = LOAD 'example-users.txt' AS (user:chararray, page:chararray); > DUMP log; > > > > (User1,a) > (User1,b) > (User2,f) > (User3,b) > (User2,a) > (User1,e) > (User2,b) > (User2,c) > (User3,d) > (User1,d) > (User2,e) > (User2,a) > (User3,c) > (User1,d) > (User2,c) > (User3,a) > (User1,d) > (User2,b) > (User2,e) > (User3,c) > > > What I'm trying to do is to say, Users visiting page 'a' also visited this > list of other pages ranked by number of times the page was visited. Can anyone > help or give me some guidance? > > > Thanks > Leslie > > > > >