[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12447346 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, here is a first crack at a standard benchmark contribution based on Andrzej 
original contribution and some updates/changes by me.  I wasn't nearly as 
ambitious  as some of the comments attached here, but I think most of them are 
good things to strive for and will greatly benefit Lucene.

I checked in the basic contrib directory structure, plus some library 
dependencies, as I wasn't sure how svn diff handles those.  I am posting this 
in patch format to solicit comments first instead of just committing and 
accepting patches.  My thoughts are I'll take a round of comments and make 
updates as warranted and then make an initial commit.  

I am particularly interested in the interface/Driver specification and whether 
people think this approach is useful or not.  My thoughts behind it were it 
might be nice to have a standard way of creating/running benchmarks that could 
be driven by XML configuration files (some examples are in the conf directory). 
 I am not 100% sold on this and am open to compelling arguments why we should 
just have each benchmark have it's own main() method.

As for the actual Benchmarker, I have created a "standard" version, which runs 
off the Reuters collection that is downloaded automatically by the ANT task.  
There are two ANT targets for the two benchmarks: run-micro-standard and 
run-standard.  The micro version takes a few minutes to run on my machine (it 
indexes 2000 docs), the other one takes a lot longer.

There are several support classes in the stats and util packages.  The stats 
package supports building and maintaining information about benchmarks.  The 
utils package contains one class for extracting information out of the Reuters 
documents for indexing.

The ReutersQueries class contains a set of Queries I created by looking at some 
of the docs in the collection and are a myriad of term, phrase, span, wildcard 
and other types of queries.  They aren't exhaustive by any means.

It should be stressed that these benchmarks are best used in gathering before 
and after numbers.  Furthermore, these aren't the be all end all of 
benchmarking for Lucene.  I hope the interface nature will encourage others to 
submit benchmarks for specific areas of Lucene not covered by this version.

Thanks to all who contributed their code/thoughts.  Patch to follow

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: BenchmarkingIndexer.pm, extract_reuters.plx, 
> LuceneBenchmark.java, LuceneIndexer.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to