I think this is a great idea and would be happy to play the game.  Re.
the collection, there is some benefit to TREC if somebody is going to do
formal recall and precision computations, otherwise it doesn't matter
much.  The best Similarity for any collection is likely to be specific
to the collection, so if the point here is to pick the best
DefaultSimilarity, the collection should be as representative of Lucene
users' content as possible (I know this is probably impossible to
achieve).

One possible danger in these kinds of bake-offs is that people who know
the content will likely craft specific queries that are not reflective
of real users.  It would be good to at least have a standard set of
queries that was tested against each implementation.  Perhaps each
person could contribute a set of test queries in addition to their
Similarity and the combined query set could be tested against each.

Finally, I'd suggest picking content that has multiple fields and allow
the individual implementations to decide how to search these fields --
just title and body would be enough.  I would like to use my
MaxDisjunctionQuery and see how it compares to other approaches (e.g.,
the default MultiFieldQueryParser, assuming somebody uses that in this
test).

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > Sent: Friday, December 17, 2004 1:27 PM
  > To: Lucene Developers List
  > Subject: DefaultSimilarity 2.0?
  > 
  > Chuck Williams wrote:
  > > Another issue will likely be the tf() and idf() computations.  I
have
  > a
  > > similar desired relevance ranking and was not getting what I
wanted
  > due
  > > to the idf() term dominating the score. [ ... ]
  > 
  > Chuck has made a series of criticisms of the DefaultSimilarity
  > implementation.  Unfortunately it is difficult to quickly evaluate
  > these, as it requires relevance judgements.  But, still, we should
  > consider modifying DefaultSimilarity for the 2.0 release if there
are
  > easy improvements to be had.  But how do we decide what's better?
  > 
  > Perhaps we should perform a formal or semi-formal evaluation of
various
  > Similarity implementations on a reference collection.  For example,
for
  > a formal evalution we might use one the TREC Web collections, which
have
  > associated queries and relevance judgements.  Or, less formally, we
  > could use a crawl of the ~5M pages in DMOZ (I would be glad to
collect
  > these using Nutch).
  > 
  > This could work as follows:
  >    -- Different folks could download and index a reference
collection,
  > offering demonstration search systems.  We would provide complete
code.
  >   These would differ only in their Similarity implementation.  All
  > implementations would use the same Analyzer and search only a single
  > field.
  >    -- These folks could then announce their candiate implementations
and
  > let others run queries against them, via HTTP.  Different Similarity
  > implementations could thus be publicly and interactively compared.
  >    -- Hopefully a consensus, or at least a healthy majority, would
agree
  > on which was the best implementation and we could make that the
default
  > for Lucene 2.0.
  > 
  > Are there folks (e.g., Chuck) who would be willing to play this
game?
  > Should we make it more formal, using, e.g., TREC?  Does anyone have
  > other ideas how we should decide how to modify DefaultSimilarity?
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to