Hi Thejas Thanks, with your input I managed to work it out. Here's my solution. Hope it's useful to someone.
/* load in the log file */ log = LOAD 'example-user-tracking.txt' AS (user:chararray, page:chararray); /* generate the list of unique users grouped by page */ unique_log = DISTINCT log; g_users = GROUP unique_log BY page; l_page_users = FOREACH g_users GENERATE group AS page, FLATTEN(unique_log.user) AS user; /* generate a list of pages grouped by users, and add on a counter */ l_counted = FOREACH log GENERATE user, page, 1 AS counter; g_userpages = GROUP l_counted BY (user,page); l_userpages = FOREACH g_userpages GENERATE FLATTEN (group), COUNT(l_counted) AS occurs; /* joined the 2 lists together using user as key */ joined = JOIN l_page_users BY user, l_userpages BY group::user; l_page_page = FOREACH joined GENERATE l_page_users::page AS page1, l_userpages::group::page AS page2, l_userp ages::occurs AS occurs; g_page_page = GROUP l_page_page BY (page1, page2); /* generate a list showing the number of occurence of starting page moving to landing page */ result = FOREACH g_page_page GENERATE group, SUM(l_page_page.occurs) AS occurs; /* display the result */ DUMP result; If someone is able to optimize this, please do share. Not sure if my version is the best way to achieve the result. Thanks L. -----Original Message----- From: Thejas Nair <[email protected]> To: [email protected] <[email protected]>; [email protected] Sent: Thu, Jul 1, 2010 6:10 am Subject: Re: Help with writing Pig Query Does this do what you want ? - L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray); L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray); -- as of current version of pig, you need to use two different loads for self join J = join L1 by page, L2 by page; -- self join on page F1 = foreach j generate L1::page as p1, L2::page as p2; G = group F1 by p1,p2; F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as visitcount; -- now you have the number of times user who visited p1 has visited p2 O = order F2 by p1, visitcount; dump O; -- you results I haven't checked the syntax of above query. One optimization you can do to reduce the output size of join, is to do a group-by on user,page , then generate the count. Then do self-join on that result, replace COUNT(F1) in F2(above) with SUM(F1.cnt) -Thejas On 6/29/10 11:37 PM, "[email protected]" <[email protected]> wrote: > Hi > I'm absolutely new with using Pig, only just picked it up like 3 days ago, and > still trying to wrap my head around it. I'm stuck with putting together a > query. > > > A DUMP of my sample dataset is as follows, > > > log = LOAD 'example-users.txt' AS (user:chararray, page:chararray); > DUMP log; > > > > (User1,a) > (User1,b) > (User2,f) > (User3,b) > (User2,a) > (User1,e) > (User2,b) > (User2,c) > (User3,d) > (User1,d) > (User2,e) > (User2,a) > (User3,c) > (User1,d) > (User2,c) > (User3,a) > (User1,d) > (User2,b) > (User2,e) > (User3,c) > > > What I'm trying to do is to say, Users visiting page 'a' also visited this > list of other pages ranked by number of times the page was visited. Can anyone > help or give me some guidance? > > > Thanks > Leslie > > > > >
