Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Ning Li Fri, 01 Sep 2006 21:47:15 -0700

I believe this patch probably also changes the merge behavior.
I think we need to discuss what exactly the new merge behavior is, if it's OK,
what we think the index invariants should be (no more than x segments of y size,
etc), and I'd like to see some code to test those invariants.


Yes, the patch does change the merge behaviour. One example was
discussed in one of my comments on TestIndexModifier.

I think it's a very good idea to think of the index invariants. The
index invariants can help check the robustness of any new merge
policy. They can also serve as a guideline when we come up with a good
merge policy for a version of addIndexes() where optimize() won't be
called (LUCENE-528).

Here is an outline of the rest of the email:
1) Current Lucene merge policy.
2) Its strengths and weaknesses. The strengths are good candidates for
index invariants.
3) Changed merge behaviour in the patch.

1) Current Lucene merge policy
TargetMergeDocs is a threshold above which a merge will happen.
Initially, it's set to maxBufferedDocs. Starting from the newest
segment, sum the doc counts of consecutive segments whose doc count is
less than targetMergeDocs. If the sum is less than targetMergeDocs, no
merge. Otherwise, merge those segments, multiply targetMergeDocs by
mergeFactor, and go through the process again.

Another part of a merge policy is how ram segments are flushed during
close(). Currently, the doc counts of all the ram segments plus one
on-disk segment are summed. If the sum is less than or equal to
mergeFactor, those segments are merged and written to disk. Otherwise,
only ram segments are merged and written to disk. (Does it make more
sense to use maxBufferedDocs as the threshold?)

2) Strengths and weakness of the current merge policy
The following notations will be used:
B = maxBufferedDocs
M = mergeFactor
f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)).

The current merge policy has two nice properties:
 1 If i and i+1 (i older, i+1 newer) are two consecutive segments of
doc counts x and y with y >= B, then f(x) >= f(y). In other words, if
B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c).
 2 Less than M number of segments whose doc count y satisfies B*(M^c)
<= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments
with the same f(y) are consecutive.
They can be proved by induction.

It also has two weaknesses:
 1 It doesn't take deletes into consideration. DocCount was used
above, not numDocs. So it's possible that a large segment with a lot
of deleted documents won't get cleaned up until much later.
 2 Property 2 above only says about segments whose doc count is >= B.
The number of on-disk segments whose doc count is < B could be well
above M because of how ram segments are flushed. For example, if
B=1000, M=10, and we close the index every time 10 documents are
inserted, we could have 99 on-disk segments each with doc count 10.
Does it make more sense to use B as the threshold during ram flush
instead of M? Or is it better to simply set a max number of on-disk
segments whose doc count is < B?

3) Changed merge behaviour in the patch
In the patch, when a merge happens, the segments being merged are
either all in ram or all on disk, but not both. Because of this, it
has a similar weakness as the second weakness of the current merge
policy, but worse. The rest are the same: the same two strengths and
the same first weakness.

I'm looking forward to more discussions on the index invarients and
how the current merge policy could be improved etc. So although the
patch could be modified to match the exact behaviour of the current
merge policy, I'll wait until we reach some agreement.

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Reply via email to