Open Source Relevance

Grant Ingersoll Mon, 19 May 2008 12:58:46 -0700

Copied from 
http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/

For a while now, I have been trying to get my hands on TREC data forthe Lucene project. For those who aren’t familiar, TREC is an annualcompetition for search engines that provides a common set of documentsto index, queries to execute and judgments to check your answers tosee how good an engine performs. While it isn’t the be all, end allfor relevance, it is a pretty good sanity check on how you are doing.For instance, many search engines do OK out of the box on it, but onceyou tune them, they can do much better. Of course, you riskovertuning to TREC as well.

In TREC, the queries and the judgments are provided for free, but onehas to pay for the data, or at least most of it, since it is usuallyowned by Reuters or some other organization. It isn’t expensive oranything, but it is a barrier none the less, especially for an opensource project. Furthermore, the whole notion of paying for data inthis day and age of open source and Creative Commons just doesn’t sitright with me. Don’t get me wrong, I’m a big fan of TREC, havingparticipated in the past, it provides a valuable service to theproprietary/academic IR community.

So, what does this have to do with Lucene? When I say I am trying toget my hands on TREC data, I don’t mean just for me, I literally meanobtaining TREC data for Lucene. That is, I want the data to be madeavailable, ideally, for all Lucene (and, for that matter, all opensource search engine) users to use and run experiments on so as tospur on innovation in Lucene’s scoring algorithms, etc. Now, I knowthe copyright owners will never allow this, as I have asked. So, mynext thought was let’s just get it for internal use by committers atApache. So, I went back to TREC and we have an agreement to do this,more or less. The problem, however, is that they say we can only usethe data on ASF (Apache) machines. Not a big deal, right? Kind of.The ASF doesn’t really have the hardware to run TREC styleexperiments. We pretty much have one Solaris “zone” alloted us (a“zone” is a virtual machine guest image running.) Furthermore, theASF is pretty much an all volunteer, worldwide distributedorganization. We do almost all of our work on our own machines asVOLUNTEERS. Practically speaking, the best way for any of us to takeadvantage of the data is to have it locally, which I am told, isn’tgoing to happen.

So, what’s the point? I think it is time the open source searchcommunity (and I don’t mean just Lucene) develop and publish a set ofTREC-style relevance judgments for freely available data that iseasily obtained from the Internet. Simply put, I am wondering ifthere are volunteers out there who would be willing to develop apractical set of queries and judgments for datasets like Wikipedia,iBiblio, the Internet Archive, etc. We wouldn’t host these datasets,we would just provide the queries and judgments, as well as the infoon how to obtain the data. Then, it is easy enough to provide simplescripts that do things like run Lucene’s contrib/benchmark Qualitytasks against said data.

Practically speaking, I don’t think we even need to go as deep asTREC. I think we would find the most use in making judgments on thetop 10 or 20 results for any given query.

So, what do others think? Am I off my rocker? Are there anyvolunteers out there? I think we could do this pretty simply throughsome scripts, and the effective use of a wiki. I don’t think our goalis, in the short run, to be scientifically rigorous, but it should beover time. Instead, I think our goal is to run a practical relevancetest like any organization should when deploying search: take 50 (top)queries and judge them, as well as 20 or so random queries and judgethem. (I wonder if Wikipedia would give us there top 50 queries, ormaybe it is already available.) Over time, we can add queries, andrefine judgments using the web 2.0 mentality of the wisdom of crowds.


FWIW, there is probably some alignment with the Wikia search project.

Cheers,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Open Source Relevance

Reply via email to