I looked at the AOL search query logs, and am thinking of creating a search query recommendation demo using P-FPGrowth, I want some suggestions from the mahout-ers regarding the kind of preprocessing that needs to be done
take a look at the the data snippet below 8805721 jack johnson 2006-05-01 19:53:02 8805721 jack johnson 2006-05-01 19:54:02 8805721 pbs 2006-05-02 18:50:46 2 http://pbskids.org 8805721 mazon 2006-05-06 16:57:50 8805721 amason 2006-05-06 17:32:23 8805721 amazon 2006-05-06 17:32:42 3 http://www.eduweb.com 8805721 amazon 2006-05-06 17:35:13 8805721 amazon 2006-05-06 17:35:48 8805721 amazon 2006-05-06 17:36:18 8805721 amazon 2006-05-06 17:36:59 16 http://www.amazon.co.uk 8805721 iatse benefits 2006-05-07 19:50:50 3 http://www.iatsenbf.org 8805721 iatse benefits prudential 2006-05-07 19:57:15 1 http://www.iatsenbf.org 8805721 iatse benefits prudential 2006-05-07 19:59:46 8805721 iatse benefits prudential 2006-05-07 20:00:12 8805721 iatse benefits prudential 2006-05-07 20:00:38 8805817 motorcycle safety course 2006-03-05 22:24:56 8805817 www.pamsp.com 2006-03-05 22:27:56 8805817 ceramic tiles 2006-03-05 22:46:50 8805817 floormall.com 2006-03-05 22:49:26 8805817 ceramic tiles 2006-03-05 22:50:10 8805817 wwwirisceramica.com 2006-03-05 22:51:33 8805817 redhead 2006-03-08 17:16:40 8805817 colorado canoe 2006-03-20 14:25:06 8805817 www.best-price.com boating&sailing 2006-03-20 14:27:04 the Data is in the format. Anon UserID, Search Query, the data+time, the rank of the url clicked(if any), hostname of the url clicked What I am thinking is given a 5 minute window in time for a given user, group all the queries (if they are unique choose only one) and call that as a transaction for PFPGrowth. Once PFPGrowth runs, it will return all the frequent co-occurring search queries for a given query(atleast i hope so :D). Does this make sense, or maybe some pointer towards any other open dataset, OR a different formulation over AOL data Robin
