Good afternoon, I am using pig on server logs to make statistics on visited pages.
For now I am able to do such matches: - one user has visited a given page matching a given aim. - one user has visited a given page belonging to one of the page's aim. * aim.id file:* 1...@1@add_to_kart --> aim 1 can be add_to_cart 1...@1@browse --> or aim 1 can be browse 1...@2@paid --> aim 2 is only paid *site.log file:* user1,1,www.site.com/browse user1,1,www.site.com/browse user2,1,www.site.com/add_to_kart user2,1,www.site.com/add_to_kart user2,1,www.site.com/paid user2,1,www.site.com/browse user3,1,www.site.com/browse *Pig script:* register 'piggybank.jar'; -- load aim id database: aim_ids = LOAD 'aim.id' USING PigStorage('@') AS (aim_site_id : int, aim_id : int , aim_url:chararray); DUMP aim_ids; -- load site log: site = LOAD 'site.log' USING PigStorage(',') AS (user_id : chararray , site_id : int, url : chararray); site_all_aims = JOIN site BY site_id, aim_ids BY aim_site_id; site_match = FOREACH site_all_aims GENERATE user_id, site_id, aim_id, org.apache.pig.piggybank.evaluation.string.INDEXOF(url, aim_url) AS match; site_aims = FILTER site_match BY (match != -1) AND (match IS NOT null); DUMP site_aims; *results:* (user1,1,1,13) --> user 1 achieved aim 1 (user1,1,1,13) --> user 1 achieved aim 1 (user2,1,1,13) --> user 2 achieved aim 1 (user2,1,1,13) --> user 2 achieved aim 1 (user2,1,2,13) --> user 2 achieved aim 2 (user2,1,1,13) --> user 2 achieved aim 1 (user3,1,1,13) --> user 3 achieved aim 1 Now I would like to check that a user has visited several pages to achieve one aim. Like for a user to achieve aim 3, he needs to visit "browse" AND "add_to_kart" AND "paid". My idea was to load tuples of aim: 1...@3@{(browse),(add_to_kart),(paid)} And to write an UDF to compare aim URL tuple, with user's visited URL bag for the site. But I am not able to load tuples with an undefined number of elements. As aims might be: 1...@3@{(browse),(add_to_kart),(paid)} 1...@4 @{(browse_something_else),(add_to_kart_something_else),(paid_something_else),(another_page)} So finally, I am stuck with this problem right now, still searching for another way to write this script and aim.id file. If any of you as any idea, mail me. Thanks Vincent HERVIEUX