I believe this patch probably also changes the merge behavior. I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants.
Yes, the patch does change the merge behaviour. One example was discussed in one of my comments on TestIndexModifier. I think it's a very good idea to think of the index invariants. The index invariants can help check the robustness of any new merge policy. They can also serve as a guideline when we come up with a good merge policy for a version of addIndexes() where optimize() won't be called (LUCENE-528). Here is an outline of the rest of the email: 1) Current Lucene merge policy. 2) Its strengths and weaknesses. The strengths are good candidates for index invariants. 3) Changed merge behaviour in the patch. 1) Current Lucene merge policy TargetMergeDocs is a threshold above which a merge will happen. Initially, it's set to maxBufferedDocs. Starting from the newest segment, sum the doc counts of consecutive segments whose doc count is less than targetMergeDocs. If the sum is less than targetMergeDocs, no merge. Otherwise, merge those segments, multiply targetMergeDocs by mergeFactor, and go through the process again. Another part of a merge policy is how ram segments are flushed during close(). Currently, the doc counts of all the ram segments plus one on-disk segment are summed. If the sum is less than or equal to mergeFactor, those segments are merged and written to disk. Otherwise, only ram segments are merged and written to disk. (Does it make more sense to use maxBufferedDocs as the threshold?) 2) Strengths and weakness of the current merge policy The following notations will be used: B = maxBufferedDocs M = mergeFactor f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)). The current merge policy has two nice properties: 1 If i and i+1 (i older, i+1 newer) are two consecutive segments of doc counts x and y with y >= B, then f(x) >= f(y). In other words, if B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c). 2 Less than M number of segments whose doc count y satisfies B*(M^c) <= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments with the same f(y) are consecutive. They can be proved by induction. It also has two weaknesses: 1 It doesn't take deletes into consideration. DocCount was used above, not numDocs. So it's possible that a large segment with a lot of deleted documents won't get cleaned up until much later. 2 Property 2 above only says about segments whose doc count is >= B. The number of on-disk segments whose doc count is < B could be well above M because of how ram segments are flushed. For example, if B=1000, M=10, and we close the index every time 10 documents are inserted, we could have 99 on-disk segments each with doc count 10. Does it make more sense to use B as the threshold during ram flush instead of M? Or is it better to simply set a max number of on-disk segments whose doc count is < B? 3) Changed merge behaviour in the patch In the patch, when a merge happens, the segments being merged are either all in ram or all on disk, but not both. Because of this, it has a similar weakness as the second weakness of the current merge policy, but worse. The rest are the same: the same two strengths and the same first weakness. I'm looking forward to more discussions on the index invarients and how the current merge policy could be improved etc. So although the patch could be modified to match the exact behaviour of the current merge policy, I'll wait until we reach some agreement. Ning --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]