I believe our objective in this test is to find the best DefaultSimilarity for Lucene. I'd like to extend it to also include finding the best approach to MultiFieldQueryParser. We can keep the two tests separate, but I'd like to get double-duty out of the core effort to set up a test and evaluation environment and process. More detailed changes to Lucene should probably be excluded from this particular test.
I'm planning to "enter" the Similarity I'm using and the DistributingMultiFieldQueryParser/MaxDisjunctionQuery that I've already posted into Bugzilla (http://issues.apache.org/bugzilla/show_bug.cgi?id=32674). I'm not viewing this as a "competition" in the sense that my objective is not to win. I'm planning on doing little or no specific tuning for the corpus, both because of the problem Joaquin cites and because I don't have the time. >From the standpoint of finding the best defaults to ship with Lucene, I agree that testing against multiple corpuses would be desirable. Chuck > -----Original Message----- > From: Joaquin Delgado [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 12:37 PM > To: Lucene Developers List > Subject: RE: DefaultSimilarity 2.0? > > I understand that not all the vector-space similarity calculation is > contained within the similarity class (where only factors and their > values are defined). Will the contestants be allowed to modify any > relevant classes/methods to improve the relevance quality? > > By experience, using only one collection of TREC or other benchmark text > corpus induces tailoring the algorithms to the corpus. To be fair we > should run the benchmarks against multiple collections and average > recall/precision. > > -- Joaquin Delgado > > -----Original Message----- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 2:25 PM > To: Lucene Developers List > Subject: RE: DefaultSimilarity 2.0? > > I agree it makes sense to isolate variables for analysis and comparison. > It also would seem that we should get as much benefit out of this > exercise as possible. So, how about multi-field docs with multiple > query test sets? One test set (or more) could have only single-field > queries. A simple way to do this might be to have three fields on the > documents: title, body, and all (= title+body). We could have just one > set of queries that were run twice with a different parser (parsing into > "all", or parsing into "title" and "body"). That would provide another > interesting comparison -- a determination of whether or not > field-specific boosting is a benefit. > > Chuck > > > -----Original Message----- > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > Sent: Monday, December 20, 2004 9:34 AM > > To: Lucene Developers List > > Subject: Re: DefaultSimilarity 2.0? > > > > Chuck Williams wrote: > > > Finally, I'd suggest picking content that has multiple fields and > > allow > > > the individual implementations to decide how to search these > fields -- > > > just title and body would be enough. I would like to use my > > > MaxDisjunctionQuery and see how it compares to other approaches > (e.g., > > > the default MultiFieldQueryParser, assuming somebody uses that in > this > > > test). > > > > I think that would be a good contest too, but I'd rather first just > > focus on the ranking of single-field queries. There are a number of > > issues that come up with multi-field queries that I'd rather > postpone in > > order to reduce the number of variables we test at one time. > > > > Doug > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]