Hi Thejas
Thanks, with your input I managed to work it out. Here's my solution. Hope it's 
useful to someone.



/* load in the log file */
log = LOAD 'example-user-tracking.txt' AS (user:chararray, page:chararray);


/* generate the list of unique users grouped by page */
unique_log = DISTINCT log;
g_users = GROUP unique_log BY page;
l_page_users = FOREACH g_users GENERATE group AS page, FLATTEN(unique_log.user) 
AS user;


/* generate a list of pages grouped by users, and add on a counter */
l_counted = FOREACH log GENERATE user, page, 1 AS counter;
g_userpages = GROUP l_counted BY (user,page);
l_userpages = FOREACH g_userpages GENERATE FLATTEN (group), COUNT(l_counted) AS 
occurs;


/* joined the 2 lists together using user as key */
joined = JOIN l_page_users BY user, l_userpages BY group::user;
l_page_page = FOREACH joined GENERATE l_page_users::page AS page1, 
l_userpages::group::page AS page2, l_userp
ages::occurs AS occurs;
g_page_page = GROUP l_page_page BY (page1, page2);
/* generate a list showing the number of occurence of starting page moving to 
landing page */
result = FOREACH g_page_page GENERATE group, SUM(l_page_page.occurs) AS occurs;


/* display the result */
DUMP result;


If someone is able to optimize this, please do share. Not sure if my version is 
the best way to achieve the result.
Thanks
L.





-----Original Message-----
From: Thejas Nair <[email protected]>
To: [email protected] <[email protected]>; 
[email protected]
Sent: Thu, Jul 1, 2010 6:10 am
Subject: Re: Help with writing Pig Query


Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
 -- as of current version of pig, you need to use two different loads for
self join 

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results


I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)

-Thejas


On 6/29/10 11:37 PM, "[email protected]" <[email protected]> wrote:

> Hi 
> I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
> still trying to wrap my head around it. I'm stuck with putting together a
> query.
> 
> 
> A DUMP of my sample dataset is as follows,
> 
> 
> log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
> DUMP log;
> 
> 
> 
> (User1,a)
> (User1,b)
> (User2,f)
> (User3,b)
> (User2,a)
> (User1,e)
> (User2,b)
> (User2,c)
> (User3,d)
> (User1,d)
> (User2,e)
> (User2,a)
> (User3,c)
> (User1,d)
> (User2,c)
> (User3,a)
> (User1,d)
> (User2,b)
> (User2,e)
> (User3,c)
> 
> 
> What I'm trying to do is to say, Users visiting page 'a' also visited this
> list of other pages ranked by number of times the page was visited. Can anyone
> help or give me some guidance?
> 
> 
> Thanks
> Leslie
> 
> 
> 
> 
> 


 

Reply via email to