[ http://issues.apache.org/jira/browse/LUCENE-528?page=comments#action_12443723 ] Ning Li commented on LUCENE-528: --------------------------------
We want a robust algorithm for the version of addIndexes() which does not call optimize(). The robustness can be expressed as the two invariants guaranteed by the merge policy for adding documents (if mergeFactor M does not change and segment doc count is not reaching maxMergeDocs): B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f(x) >= f(y). 2: The number of committed segments on the same level (f(n)) <= M. References are at http://www.gossamer-threads.com/lists/lucene/java-dev/35147, LUCENE-565 and LUCENE-672. AddIndexes() can be viewed as adding a sequence of segments S to a sequence of segments T. Segments in T follow the invariants but segments in S may not since they could come from multiple indexes. Here is the merge algorithm for addIndexes(): 1. Flush ram segments. 2. Consider a combined sequence with segments from T followed by segments from S (same as current addIndexes()). 3. Assume the highest level for segments in S is h. Call maybeMergeSegments(), but instead of starting w/ lowerBound = -1 and upperBound = maxBufferedDocs, start w/ lowerBound = -1 and upperBound = upperBound of level h. After this, the invariants are guaranteed except for the last < M segments whose levels <= h. 4. If the invariants hold for the last < M segments whose levels <= h, done. Otherwise, simply merge those segments. If the merge results in a segment of level <= h, done. Otherwise, it's of level h+1 and call maybeMergeSegments() starting w/ upperBound = upperBound of level h+1. Suggestions? > Optimization for IndexWriter.addIndexes() > ----------------------------------------- > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Steven Tamm > Assigned To: Otis Gospodnetic > Priority: Minor > Attachments: AddIndexes.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]