On 9/2/06, Ning Li <[EMAIL PROTECTED]> wrote:
Here is an outline of the rest of the email:
1) Current Lucene merge policy.
2) Its strengths and weaknesses. The strengths are good candidates for
index invariants.
3) Changed merge behaviour in the patch.
1) Current Lucene merge policy
TargetMergeDocs is a threshold above which a merge will happen.
Initially, it's set to maxBufferedDocs. Starting from the newest
segment, sum the doc counts of consecutive segments whose doc count is
less than targetMergeDocs. If the sum is less than targetMergeDocs, no
merge. Otherwise, merge those segments, multiply targetMergeDocs by
mergeFactor, and go through the process again.
Another part of a merge policy is how ram segments are flushed during
close(). Currently, the doc counts of all the ram segments plus one
on-disk segment are summed. If the sum is less than or equal to
mergeFactor, those segments are merged and written to disk. Otherwise,
only ram segments are merged and written to disk. (Does it make more
sense to use maxBufferedDocs as the threshold?)
Yes, I think it does make more sense to use maxBufferedDocs.
2) Strengths and weakness of the current merge policy
The following notations will be used:
B = maxBufferedDocs
M = mergeFactor
f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)).
The current merge policy has two nice properties:
1 If i and i+1 (i older, i+1 newer) are two consecutive segments of
doc counts x and y with y >= B, then f(x) >= f(y). In other words, if
B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c).
2 Less than M number of segments whose doc count y satisfies B*(M^c)
<= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments
with the same f(y) are consecutive.
They can be proved by induction.
It also has two weaknesses:
1 It doesn't take deletes into consideration. DocCount was used
above, not numDocs. So it's possible that a large segment with a lot
of deleted documents won't get cleaned up until much later.
2 Property 2 above only says about segments whose doc count is >= B.
The number of on-disk segments whose doc count is < B could be well
above M because of how ram segments are flushed. For example, if
B=1000, M=10, and we close the index every time 10 documents are
inserted, we could have 99 on-disk segments each with doc count 10.
Does it make more sense to use B as the threshold during ram flush
instead of M? Or is it better to simply set a max number of on-disk
segments whose doc count is < B?
What about an invariant that says the number of main index segments
with the same level (f(n)) should be less than M.
I am concerned about corner cases causing tons of segments and slowing
search or causing errors due to file descriptor exhaustion.
When merging, maybe we should count the number of segments at a
particular index level f(n), rather than adding up the number of
documents. In the presence of deletions, this should lead to faster
indexing (due to less frequent merges) I think.
3) Changed merge behaviour in the patch
In the patch, when a merge happens, the segments being merged are
either all in ram or all on disk, but not both. Because of this, it
has a similar weakness as the second weakness of the current merge
policy, but worse. The rest are the same: the same two strengths and
the same first weakness.
I'm looking forward to more discussions on the index invarients and
how the current merge policy could be improved etc. So although the
patch could be modified to match the exact behaviour of the current
merge policy, I'll wait until we reach some agreement.
What is the behavior of your patch under the current scenario:
M=10, B=1000
open writer, add 3 docs, close writer
open writer, add 1000 docs, close writer
Do you avoid the situation of having segments with docs=3 and 1000
(hence f(n) increases as you increase segment numbers... a no-no)?
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]