I think this is a great idea and would be happy to play the game. Re. the collection, there is some benefit to TREC if somebody is going to do formal recall and precision computations, otherwise it doesn't matter much. The best Similarity for any collection is likely to be specific to the collection, so if the point here is to pick the best DefaultSimilarity, the collection should be as representative of Lucene users' content as possible (I know this is probably impossible to achieve).
One possible danger in these kinds of bake-offs is that people who know the content will likely craft specific queries that are not reflective of real users. It would be good to at least have a standard set of queries that was tested against each implementation. Perhaps each person could contribute a set of test queries in addition to their Similarity and the combined query set could be tested against each.
Finally, I'd suggest picking content that has multiple fields and allow the individual implementations to decide how to search these fields -- just title and body would be enough. I would like to use my MaxDisjunctionQuery and see how it compares to other approaches (e.g., the default MultiFieldQueryParser, assuming somebody uses that in this test).
I believe the collection that I'm using in LuceneBenchmark meets most if not all of these requirements - the "20 newsgroups" corpus. Please see the following link for the benchmark code:
http://www.getopt.org/lb/LuceneBenchmark.java
This collection has the benefit that it's relatively easy to judge the relative relevance scores, because the nature and structure of the corpus is well understood.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]