Copied from http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/
For a while now, I have been trying to get my hands on TREC data for the Lucene project. For those who aren’t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an engine performs. While it isn’t the be all, end all for relevance, it is a pretty good sanity check on how you are doing. For instance, many search engines do OK out of the box on it, but once you tune them, they can do much better. Of course, you risk overtuning to TREC as well.
In TREC, the queries and the judgments are provided for free, but one has to pay for the data, or at least most of it, since it is usually owned by Reuters or some other organization. It isn’t expensive or anything, but it is a barrier none the less, especially for an open source project. Furthermore, the whole notion of paying for data in this day and age of open source and Creative Commons just doesn’t sit right with me. Don’t get me wrong, I’m a big fan of TREC, having participated in the past, it provides a valuable service to the proprietary/academic IR community.
So, what does this have to do with Lucene? When I say I am trying to get my hands on TREC data, I don’t mean just for me, I literally mean obtaining TREC data for Lucene. That is, I want the data to be made available, ideally, for all Lucene (and, for that matter, all open source search engine) users to use and run experiments on so as to spur on innovation in Lucene’s scoring algorithms, etc. Now, I know the copyright owners will never allow this, as I have asked. So, my next thought was let’s just get it for internal use by committers at Apache. So, I went back to TREC and we have an agreement to do this, more or less. The problem, however, is that they say we can only use the data on ASF (Apache) machines. Not a big deal, right? Kind of. The ASF doesn’t really have the hardware to run TREC style experiments. We pretty much have one Solaris “zone” alloted us (a “zone” is a virtual machine guest image running.) Furthermore, the ASF is pretty much an all volunteer, worldwide distributed organization. We do almost all of our work on our own machines as VOLUNTEERS. Practically speaking, the best way for any of us to take advantage of the data is to have it locally, which I am told, isn’t going to happen.
So, what’s the point? I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.
Practically speaking, I don’t think we even need to go as deep as TREC. I think we would find the most use in making judgments on the top 10 or 20 results for any given query.
So, what do others think? Am I off my rocker? Are there any volunteers out there? I think we could do this pretty simply through some scripts, and the effective use of a wiki. I don’t think our goal is, in the short run, to be scientifically rigorous, but it should be over time. Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them. (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.) Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.
FWIW, there is probably some alignment with the Wikia search project. Cheers, Grant --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]