Re: Benchmarking on GOV2

Andrzej Bialecki Mon, 29 May 2006 08:37:53 -0700

Dave Kor wrote:

Hi,


On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote:

Dear Lucene developers,
I'd be interested in doing some benchmarking on (at least) Lucene,
Egothor and MG4J. There is no actual data around on publicly available
collections, and it would be nice to have some more objective data on
efficiency for a significantly large collection.


I was wondering if you have seen the TREC 2004 paper by Giuseppe
Attardi, Andrea Esuli and Chirag Pate from the University of Pisa,
Italy, titled "Using Clustering and Blade Clusters in the TeraByte
task"? http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf

In the paper, three search engines (including Lucene) was benchmarked
on the GOV2 corpus.

I briefly looked at this document, but the testing environment is notdescribed clearly enough. E.g. for Lucene, there is no information aboutthe JDK version, heap size, whether it was run with -server or -client.Also, the authors mention that "times were obtained after repeating thequery twice, in order to allow for the effects of memory caching", whichinstantly makes me suspicious ... HotSpot usually requires severalminutes of warm-up. In short, I think the numbers for Lucene are not tobe trusted.

The indexing times seem strange, too - couple minutes for other engines,and > 4 hours for Lucene? Something's wrong here ...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmarking on GOV2

Reply via email to