Good afternoon,

I am using pig on server logs to make statistics on visited pages.

For now I am able to do such matches:

- one user has visited a given page matching a given aim.
- one user has visited a given page belonging to one of the page's aim.
*
aim.id file:*

1...@1@add_to_kart              --> aim 1 can be add_to_cart
1...@1@browse                     --> or aim 1 can be browse
1...@2@paid                         --> aim 2 is only paid

*site.log file:*

user1,1,www.site.com/browse
user1,1,www.site.com/browse
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/paid
user2,1,www.site.com/browse
user3,1,www.site.com/browse

*Pig script:*

register 'piggybank.jar';

-- load aim id database:
aim_ids = LOAD 'aim.id' USING PigStorage('@') AS (aim_site_id : int, aim_id
: int , aim_url:chararray);

DUMP aim_ids;

-- load site log:
site = LOAD 'site.log' USING PigStorage(',') AS (user_id : chararray ,
site_id : int, url : chararray);

site_all_aims = JOIN site BY site_id, aim_ids BY aim_site_id;

site_match = FOREACH site_all_aims GENERATE user_id, site_id, aim_id,
org.apache.pig.piggybank.evaluation.string.INDEXOF(url, aim_url) AS match;

site_aims = FILTER site_match BY (match != -1) AND (match IS NOT null);

DUMP site_aims;

*results:*

(user1,1,1,13) --> user 1 achieved aim 1
(user1,1,1,13) --> user 1 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,2,13) --> user 2 achieved aim 2
(user2,1,1,13) --> user 2 achieved aim 1
(user3,1,1,13) --> user 3 achieved aim 1


Now I would like to check that a user has visited several pages to achieve
one aim. Like for a user to achieve aim 3, he needs to visit "browse" AND
"add_to_kart" AND "paid".

My idea was to load tuples of aim:

1...@3@{(browse),(add_to_kart),(paid)}

And to write an UDF to compare aim URL tuple, with user's visited URL bag
for the site.

But I am not able to load tuples with an undefined number of elements. As
aims might be:

1...@3@{(browse),(add_to_kart),(paid)}
1...@4
@{(browse_something_else),(add_to_kart_something_else),(paid_something_else),(another_page)}

So finally, I am stuck with this problem right now, still searching for
another way to write this script and aim.id file.

If any of you as any idea, mail me.

Thanks

Vincent HERVIEUX

Reply via email to