Hi Andrzej, Thanks for bringing up that AOL dataset (I've got a copy of that stashed away), because the person I'm helping looked at this, and we thought it didn't have all the data one needs to build custom relevance models. Here is a small sample:
AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 ameriprise.com 2006-03-01 14:06:23 1 http://www.ameriprise.com 217 susheme 2006-03-02 12:31:08 217 united.com 2006-03-03 14:54:13 217 mizuno.com 2006-03-07 22:41:17 1 http://www.mizuno.com 217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:27 217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:35 217 buddylis 2006-03-16 15:23:33 217 bestasiancompany.com 2006-03-20 15:15:43 1 http://www.bestasiancompany.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 217 lottery 2006-03-27 16:34:59 1 http://www.calottery.com 217 ask.com 2006-03-31 14:31:10 1 http://www.ask.com For instance, in order to build custom relevance models, wouldn't we need to have the actual corpus/index associated with this data in order to get the base relevance scores first? Or could one just look at clicks where ItemRank is low (meaning they were not close to the top of search results) and apply some algo that essentially produces a boost score that stands on its own and is applied on top of the relevance score at search time? Would it make sense to have a global boost score for each document, or would that need to be query-specific and thus applied at query-time and not at index-time? If you have an idea how one could/should go about using just the above to build custom relevance models for Lucene, I'm all eyeballs. Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Andrzej Bialecki <[email protected]> > To: [email protected] > Sent: Wed, March 2, 2011 5:07:55 AM > Subject: Re: Query & click logs for custom Lucene relevance models > > On 3/2/11 3:39 AM, Otis Gospodnetic wrote: > > Hello, > > > > I'm helping out a student interested in using query and click logs to build > > custom relevance models for Lucene. Step #1 is finding a good dataset that > > contains the needed data. I've looked around, found a few things, but >nothing > > that looks very good. > > > > I was wondering if anyone has any dataset suggestions? > > The (in)famous AOL dataset comes to my mind, and it's very good, maybe even >too good :) AOL officially pulled it back, but it's still available and IMHO >legitimate to use - it was a blunder all right but it carried a suitable >license and things can't be un-published ... > > -- Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
