Dave Kor wrote:
Hi,
On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote:
Dear Lucene developers,
I'd be interested in doing some benchmarking on (at least) Lucene,
Egothor and MG4J. There is no actual data around on publicly available
collections, and it would be nice to have some more objective data on
efficiency for a significantly large collection.
I was wondering if you have seen the TREC 2004 paper by Giuseppe
Attardi, Andrea Esuli and Chirag Pate from the University of Pisa,
Italy, titled "Using Clustering and Blade Clusters in the TeraByte
task"? http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf
In the paper, three search engines (including Lucene) was benchmarked
on the GOV2 corpus.
I briefly looked at this document, but the testing environment is not
described clearly enough. E.g. for Lucene, there is no information about
the JDK version, heap size, whether it was run with -server or -client.
Also, the authors mention that "times were obtained after repeating the
query twice, in order to allow for the effects of memory caching", which
instantly makes me suspicious ... HotSpot usually requires several
minutes of warm-up. In short, I think the numbers for Lucene are not to
be trusted.
The indexing times seem strange, too - couple minutes for other engines,
and > 4 hours for Lucene? Something's wrong here ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]