Leo,

For the most part I'm a lurker.  I do click on links provided by members of
the list, and I must say, your link below for egothor is rather impressive.
I don't see an "about us" link on the site re:  how we started, or if
egothor is based on anything (I assume not, otherwise it would have been
mentioned).

Is the only info available that which is found on the site?

Regards,

John



-----Original Message-----
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 21, 2004 6:58 AM
To: Lucene Developers List
Subject: Re: large index scalability and java 1.1 compatibility question


I'm sorry if you receive this e-mail twice. My ISP has problems with
SMTP relay.

Mike Sawka wrote:

>We are currently running some multi-gigabyte indexes with over 10
>million documents, and the "optimize" time is starting to become a
>problem.  For our largest indexes we're already seeing times of 10-20
>minutes, on a fairly decent machine, which is starting to hit the
>threshold of acceptability for us (and will become unbearable as the
>index grows 2-10 times larger).  So I've got two questions:
>
>   * Are there any tricks that you guys use to run large (incrementally
>updatable) indexes?  I've already setup a mirroring system so I have one
>index that is always searchable while the other one is incrementally
>updating (and they swap periodically).
>
>

The optimize() routine is a bottleneck of Lucene. You have two options:
a) not to call optimize(); b) modify your index significantly (>75% of
items), and then call optimize(). Somebody may give you some advice, but
there is theoretical barrier which cannot be undone.

I was interested in this problem last year, and the method which was
developed for another OSS search engine is presented here:
http://www.egothor.org/temp/00-combi.png. The figure shows comparison
between my method and a build-from-scratch approach of merge factor 100.
Lucene (merge factor 100) seems to be slower than my method: about 40%
in case of N=2^16, about 15-20% in case of N=2^46, thus add these values
to the presented numbers, and you would see what Lucene does and when.

Using the figure, you can analyze whether you would rather rebuild your
index from scratch, or repair it using insert/removeDoc()/optimize(). If
both ways failed, you should redesign your application.

Hope this helps.

Leo

PS: The figure is based on a simulation of my algorithm. The results for
N<2^26 were already verified in a real system. "number of documents" is
log_2(total number of documents in the index) (2^16...2^46), "operations
needed" summarizes I/O read and write operations and compares them to
I/O during rebuild-from-scratch.

N=total number of docs in the index



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to