Re: DefaultSimilarity 2.0?

Andrzej Bialecki Fri, 17 Dec 2004 16:06:06 -0800

Chuck Williams wrote:

I think this is a great idea and would be happy to play the game.  Re.
the collection, there is some benefit to TREC if somebody is going to do
formal recall and precision computations, otherwise it doesn't matter
much.  The best Similarity for any collection is likely to be specific
to the collection, so if the point here is to pick the best
DefaultSimilarity, the collection should be as representative of Lucene
users' content as possible (I know this is probably impossible to
achieve).

One possible danger in these kinds of bake-offs is that people who know
the content will likely craft specific queries that are not reflective
of real users.  It would be good to at least have a standard set of
queries that was tested against each implementation.  Perhaps each
person could contribute a set of test queries in addition to their
Similarity and the combined query set could be tested against each.

Finally, I'd suggest picking content that has multiple fields and allow
the individual implementations to decide how to search these fields --
just title and body would be enough.  I would like to use my
MaxDisjunctionQuery and see how it compares to other approaches (e.g.,
the default MultiFieldQueryParser, assuming somebody uses that in this
test).

I believe the collection that I'm using in LuceneBenchmark meets most if not all of these requirements - the "20 newsgroups" corpus. Please see the following link for the benchmark code:

        http://www.getopt.org/lb/LuceneBenchmark.java

This collection has the benefit that it's relatively easy to judge the relative relevance scores, because the nature and structure of the corpus is well understood.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DefaultSimilarity 2.0?

Reply via email to