[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436858 ] Michael McCandless commented on LUCENE-675: -------------------------------------------
I think this is an incredibly important initiative: with every non-trivial change to Lucene (eg lock-less commits) we must verify performance did not get worse. But, as things stand now, it's an ad-hoc thing that each developer needs to do. So (as a consumer of this), I would love to have a ready-to-use standard test that I could run to check if I've slowed things down with lock-less commits. In the mean time I've been using Europarl for my testing. Also important to realize is there are many dimensions to test. With lock-less I'm focusing entirely on "wall clock time to open readers and writers" in different use cases like pure indexing, pure searching, highly interactive mixed indexing/searching, etc. And this is actually hard to test cleanly because in certain cases (highly interactive case, or many readers case), the current Lucene hits many "commit lock" retries and/or timeouts (whereas lock-less doesn't). So what's a "fair" comparison in this case? In addition to standardizing on the corpus I think we ideallly need standardized hardware / OS / software configuration as well, so the numbers are easily comparable across time. Even the test process itself is important, eg details like "you should reboot the box before each run" and "discard results from first run then take average of next 3 runs as your result", are important. It would be wonderful if we could get this into a nightly automated regression test so we could track over time how the performance has changed (and, for example, quickly detect accidental regressions). We should probably open this as a separate issue which depends first on this issue being complete. > Lucene benchmark: objective performance test for Lucene > ------------------------------------------------------- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Andrzej Bialecki > Assigned To: Grant Ingersoll > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]