Chuck Williams wrote:
I've got about 30k documents and have 3 indexing scenarios:

1.       Full indexing and optimize
2.       Incremental indexing and optimize
3.       Parallel incremental indexing without optimize

Search performance is critical.  For both cases 1 and 2, I'd like the
fastest possible indexing time.  For case 3, I'd like minimal pauses and
no noticeable degradation in search performance.



Based on reading the code (including the javadocs comments), I'm
thinking of values along these lines:

mergeFactor:  1000 during Full indexing, and during optimize (for both
cases 1 and 2); 10 during incremental indexing (cases 2 and 3)

1000 is too big of a mergeFactor for any practical purpose.

I don't see a point in using different mergeFactors in cases 1 and 2. If you're going to optimize before you search, then you want the fastest batch indexing mode. I would use something like 50 for both cases 1 and 2.

For case 3, where unoptimized search performance is very important, I would use something smaller than 10. For Technorati's blog search, which incrementally maintains a Lucene index with millions of documents, I used a mergeFactor of 2 in order to maximize search performance. Indexing performance on a single CPU is still adequate to keep up with the rate of change of today's blogosphere.

minMergeDocs: 1000 during Full indexing, 10 during incremental indexing

I see no reason to lower this when indexing incrementally. 1000 is a good value for high performance indexing when RAM is plentiful and documents are not too large.


maxMergeDocs:  Integer.MAX_VALUE during full indexing, 1000 during
incremental indexing

1000 seems low to me, as it will result in too many segments, slowing search. Here one should select the largest value that can be merged in the maximum time delay permitted in your application between a new document arriving and it appearing in search results. So how up-to-date must your index be? If it's okay for it to ocassionally be a few minutes out of date, then you can probably safely increase this to at least tens or hundreds of thousands, perhaps even millions. When incrementally indexing, the most recently added segments stay cached in RAM by the filesystem. So, on a system with a gigabyte of RAM that's dedicated to incremental indexing, you might safely set maxMergeDocs to account for a few hundred megabytes of index without encountering slow, i/o-bound merges.


Since mergeFactor is used in both addDocument() and optimize(), I'm
thinking of using two different values in case 2:  10 during the
incremental indexing, and then 1000 during the optimize.  Is changing
the value like this going to cause a problem?

It should not cause problems to use different mergeFactors at different times.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to