Re: DefaultSimilarity 2.0?, benchmark

Paul Elschot Sun, 19 Dec 2004 03:07:03 -0800

I just got the benchmark running here. It needs a library for
the org.apache.commons.compress.tar package. I had to build
that from cvs, so in case anyone needs it, I'll gladly send the jar
or post it in bugzilla. Or did I miss the place to download a
commons sandbox library?


On Saturday 18 December 2004 19:07, Chuck Williams wrote:
> I haven't run it yet to take a look at the collections, but the code
> looks fine.  subject and body will make good content fields to query
> against.  I think we need a couple additional things, though:
>   1.  An interactive UI for trying queries -- should be a webapp so that
> people can use it.  The batch query UI should be maintained for running
> a standard test set.  There needs to be a way to see the results of the
> batch test (didn't look carefully at how this is done now -- emphasis in
> this test is on understanding the ordering and scoring of results, not
> on performance, although basic timing should be included in case any of
> the implementations differ on this dimension).

Luke would do fine for that, I suppose. To use it, the benchmark would need to 
be split into parts for index building and querying.

I'd also like to see some more queries in there to try and test the
alternative boolean scorer.
As this is TREC test data, I would suppose there are some more queries
available. Could someone give me a hint as to where I could find more queries?

Regards,
Paul Elschot

>   2.  Instead of using QueryParser against body, it should use
> MultiFieldQueryParser against subject and body (or maybe against
> subject, from and body).  Apps may change this (I will change it to use
> my approach for multiple fields).
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
>   > Sent: Friday, December 17, 2004 4:06 PM
>   > To: Lucene Developers List
>   > Subject: Re: DefaultSimilarity 2.0?
>   > 
>   > Chuck Williams wrote:
>   > > I think this is a great idea and would be happy to play the game.
> Re.
>   > > the collection, there is some benefit to TREC if somebody is going
> to
>   > do
>   > > formal recall and precision computations, otherwise it doesn't
> matter
>   > > much.  The best Similarity for any collection is likely to be
> specific
>   > > to the collection, so if the point here is to pick the best
>   > > DefaultSimilarity, the collection should be as representative of
>   > Lucene
>   > > users' content as possible (I know this is probably impossible to
>   > > achieve).
>   > >
>   > > One possible danger in these kinds of bake-offs is that people who
>   > know
>   > > the content will likely craft specific queries that are not
> reflective
>   > > of real users.  It would be good to at least have a standard set
> of
>   > > queries that was tested against each implementation.  Perhaps each
>   > > person could contribute a set of test queries in addition to their
>   > > Similarity and the combined query set could be tested against
> each.
>   > >
>   > > Finally, I'd suggest picking content that has multiple fields and
>   > allow
>   > > the individual implementations to decide how to search these
> fields --
>   > > just title and body would be enough.  I would like to use my
>   > > MaxDisjunctionQuery and see how it compares to other approaches
> (e.g.,
>   > > the default MultiFieldQueryParser, assuming somebody uses that in
> this
>   > > test).
>   > 
>   > I believe the collection that I'm using in LuceneBenchmark meets
> most if
>   > not all of these requirements - the "20 newsgroups" corpus. Please
> see
>   > the following link for the benchmark code:
>   > 
>   >   http://www.getopt.org/lb/LuceneBenchmark.java
>   > 
>   > 
>   > This collection has the benefit that it's relatively easy to judge
> the
>   > relative relevance scores, because the nature and structure of the
>   > corpus is well understood.
>   > 
>   > --
>   > Best regards,
>   > Andrzej Bialecki
>   >   ___. ___ ___ ___ _ _   __________________________________
>   > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>   > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>   > http://www.sigram.com  Contact: info at sigram dot com
>   > 
>   > 
>   >
> ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: [EMAIL PROTECTED]
>   > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DefaultSimilarity 2.0?, benchmark

Reply via email to