I'm sorry if you receive this e-mail twice. My ISP has problems with SMTP relay.

Mike Sawka wrote:

We are currently running some multi-gigabyte indexes with over 10
million documents, and the "optimize" time is starting to become a
problem.  For our largest indexes we're already seeing times of 10-20
minutes, on a fairly decent machine, which is starting to hit the
threshold of acceptability for us (and will become unbearable as the
index grows 2-10 times larger).  So I've got two questions:

* Are there any tricks that you guys use to run large (incrementally
updatable) indexes? I've already setup a mirroring system so I have one
index that is always searchable while the other one is incrementally
updating (and they swap periodically).



The optimize() routine is a bottleneck of Lucene. You have two options: a) not to call optimize(); b) modify your index significantly (>75% of items), and then call optimize(). Somebody may give you some advice, but there is theoretical barrier which cannot be undone.

I was interested in this problem last year, and the method which was
developed for another OSS search engine is presented here:
http://www.egothor.org/temp/00-combi.png. The figure shows comparison
between my method and a build-from-scratch approach of merge factor 100.
Lucene (merge factor 100) seems to be slower than my method: about 40%
in case of N=2^16, about 15-20% in case of N=2^46, thus add these values
to the presented numbers, and you would see what Lucene does and when.

Using the figure, you can analyze whether you would rather rebuild your
index from scratch, or repair it using insert/removeDoc()/optimize(). If
both ways failed, you should redesign your application.

Hope this helps.

Leo

PS: The figure is based on a simulation of my algorithm. The results for
N<2^26 were already verified in a real system. "number of documents" is
log_2(total number of documents in the index) (2^16...2^46), "operations
needed" summarizes I/O read and write operations and compares them to
I/O during rebuild-from-scratch.

N=total number of docs in the index



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to