Yeah, I didn't play yet with millions of documents. We will need a bigger test collection, I think! Although the benchmarker can add as many as you want from the same source, index compression will effect the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection. I think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

Michael McCandless wrote:
Also, one caveat: whenever #docs (21578 for Reuters) divided by
maxBuffered docs is less than mergeFactor, you will have no merges
take place during your runs.  This greatly skews the results.

Also, my guess is that this index fits entirely in the buffer cache. Things behave quite differently when segments are larger than available memory and merging requires lots of disk i/o.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to