I haven't run it yet to take a look at the collections, but the code looks fine. subject and body will make good content fields to query against. I think we need a couple additional things, though: 1. An interactive UI for trying queries -- should be a webapp so that people can use it. The batch query UI should be maintained for running a standard test set. There needs to be a way to see the results of the batch test (didn't look carefully at how this is done now -- emphasis in this test is on understanding the ordering and scoring of results, not on performance, although basic timing should be included in case any of the implementations differ on this dimension). 2. Instead of using QueryParser against body, it should use MultiFieldQueryParser against subject and body (or maybe against subject, from and body). Apps may change this (I will change it to use my approach for multiple fields).
Chuck > -----Original Message----- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Friday, December 17, 2004 4:06 PM > To: Lucene Developers List > Subject: Re: DefaultSimilarity 2.0? > > Chuck Williams wrote: > > I think this is a great idea and would be happy to play the game. Re. > > the collection, there is some benefit to TREC if somebody is going to > do > > formal recall and precision computations, otherwise it doesn't matter > > much. The best Similarity for any collection is likely to be specific > > to the collection, so if the point here is to pick the best > > DefaultSimilarity, the collection should be as representative of > Lucene > > users' content as possible (I know this is probably impossible to > > achieve). > > > > One possible danger in these kinds of bake-offs is that people who > know > > the content will likely craft specific queries that are not reflective > > of real users. It would be good to at least have a standard set of > > queries that was tested against each implementation. Perhaps each > > person could contribute a set of test queries in addition to their > > Similarity and the combined query set could be tested against each. > > > > Finally, I'd suggest picking content that has multiple fields and > allow > > the individual implementations to decide how to search these fields -- > > just title and body would be enough. I would like to use my > > MaxDisjunctionQuery and see how it compares to other approaches (e.g., > > the default MultiFieldQueryParser, assuming somebody uses that in this > > test). > > I believe the collection that I'm using in LuceneBenchmark meets most if > not all of these requirements - the "20 newsgroups" corpus. Please see > the following link for the benchmark code: > > http://www.getopt.org/lb/LuceneBenchmark.java > > > This collection has the benefit that it's relatively easy to judge the > relative relevance scores, because the nature and structure of the > corpus is well understood. > > -- > Best regards, > Andrzej Bialecki > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]