Hi,

----- Original Message ----
From: Andrzej Bialecki <[EMAIL PROTECTED]>

Dave Kor wrote:
> Hi,
>
> On 5/29/06, Sebastiano Vigna <[EMAIL PROTECTED]> wrote:
>> Dear Lucene developers,
>> I'd be interested in doing some benchmarking on (at least) Lucene,
>> Egothor and MG4J. There is no actual data around on publicly available
>> collections, and it would be nice to have some more objective data on
>> efficiency for a significantly large collection.
>
> I was wondering if you have seen the TREC 2004 paper by Giuseppe
> Attardi, Andrea Esuli and Chirag Pate from the University of Pisa,
> Italy, titled "Using Clustering and Blade Clusters in the TeraByte
> task"? http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf
>
> In the paper, three search engines (including Lucene) was benchmarked
> on the GOV2 corpus.

I briefly looked at this document, but the testing environment is not 
described clearly enough. E.g. for Lucene, there is no information about 
the JDK version, heap size, whether it was run with -server or -client. 
Also, the authors mention that "times were obtained after repeating the 
query twice, in order to allow for the effects of memory caching", which 
instantly makes me suspicious ... HotSpot usually requires several 
minutes of warm-up. In short, I think the numbers for Lucene are not to 
be trusted.

OG: there are also command line options that tell the HotSpot how quickly to 
optimize frequent execution paths, for instance.

The indexing times seem strange, too - couple minutes for other engines, 
and > 4 hours for Lucene? Something's wrong here ...


OG: But Andrzej, you already wrote that indexing benchmark tool (which we never 
put anywhere in SVN, I'm afraid) that works on some freely available Reuters 
corpus, I believe.  Why couldn't that be adapted for testing Lucene, Egothor, 
and MG4J?

Otis





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to