I looked at the AOL search query logs, and am thinking of creating a search
query recommendation demo using P-FPGrowth, I want some suggestions from the
mahout-ers regarding the kind of preprocessing that needs to be done

take a look at the the data snippet below

8805721 jack johnson 2006-05-01 19:53:02
8805721 jack johnson 2006-05-01 19:54:02
8805721 pbs 2006-05-02 18:50:46 2 http://pbskids.org
8805721 mazon 2006-05-06 16:57:50
8805721 amason 2006-05-06 17:32:23
8805721 amazon 2006-05-06 17:32:42 3 http://www.eduweb.com
8805721 amazon 2006-05-06 17:35:13
8805721 amazon 2006-05-06 17:35:48
8805721 amazon 2006-05-06 17:36:18
8805721 amazon 2006-05-06 17:36:59 16 http://www.amazon.co.uk
8805721 iatse benefits 2006-05-07 19:50:50 3 http://www.iatsenbf.org
8805721 iatse benefits prudential 2006-05-07 19:57:15 1
http://www.iatsenbf.org
8805721 iatse benefits prudential 2006-05-07 19:59:46
8805721 iatse benefits prudential 2006-05-07 20:00:12
8805721 iatse benefits prudential 2006-05-07 20:00:38
8805817 motorcycle safety course 2006-03-05 22:24:56
8805817 www.pamsp.com 2006-03-05 22:27:56
8805817 ceramic tiles 2006-03-05 22:46:50
8805817 floormall.com 2006-03-05 22:49:26
8805817 ceramic tiles 2006-03-05 22:50:10
8805817 wwwirisceramica.com 2006-03-05 22:51:33
8805817 redhead 2006-03-08 17:16:40
8805817 colorado canoe 2006-03-20 14:25:06
8805817 www.best-price.com boating&sailing 2006-03-20 14:27:04

the Data is in the format. Anon UserID, Search Query, the data+time, the
rank of the url clicked(if any), hostname of the url clicked


What I am thinking is given a 5 minute window in time for a given user,
group all the queries (if they are unique choose only one) and call that as
a transaction for PFPGrowth.

Once PFPGrowth runs, it will return all the frequent co-occurring search
queries for a given query(atleast i hope so :D).


Does this make sense, or maybe some pointer towards any other open dataset,
OR a different formulation over AOL data


Robin

Reply via email to