[ 
https://issues.apache.org/jira/browse/LUCENE-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-854:
--------------------------------------

    Attachment: LUCENE-854.patch

I created a new merge policy, to take advantage of non-contiguous merging 
(LUCENE-1076) and fix certain limitations of LogMergePolicy.

The new policy does not support contiguous merging, and always merges according 
to byte size, always pro rated by pct deletes.

The policy's core logic is similar to LogMP, in that it tries to merge roughly 
equal sized segments at once, maxMergeAtOnce (renamed from mergeFactor) at a 
time, resulting in the usual exponential staircase pattern when you feed it 
roughly equal sized segments.

You configure the approx max merged segment size (unlike LogMP where you 
configure the max to-be-merged size, which was always a source of confusion!).  
Unlike LogMP, when segments are getting close to being too large, the new 
policy will merge fewer segs, eg down to merging pairwise, to reach approx the 
max allowed size.  This is important since it makes that setting more 
"accurate"; I now default it to 5 GB (vs LogMP's 2 GB).

There is a separate maxMergeAtOnceExplicit that controls "explicit" merging 
(ie, app calls optimize or expungeDeletes, and maybe in the future also 
addIndexes); I defaulted it to 30.  There is no max segment size for optimize.

The big difference vs LogMP is that the new policy does not "over-merge", 
meaning it does not "pay it forward"/forcefully cascade the way LogMP does 
today.  This fixes the "inadvertent optimize" that LogMP does.

For any given sized index, the new policy computes a budget of how many 
segments that index is allowed to have (ie, it enumerates the steps in the 
stair case, based on mergeAtOnce, [floored] min segment size, and total bytes 
in the index); then, if the index is over-budget it picks the least-cost merge. 
 This results in a smoother progression over time of number of segments.

There is a new configuration, segmentsPerTier, that lets you control how many 
segments per level you can "tolerate".  This is a nice knob to turn to tradeoff 
merge cost vs search cost.  It defaults to 10, which means it matches the 
staircase pattern that LogMP produces, but you can now separately control the 
"width" of the stairs in the staircase, from how many segments are merged at 
once for non-explicit merges.

It has useCompoundFile and noCFSRatio just like LogMP.

It has a new setting "expungeDeletesPctAllowed", default 10%, which allows 
expungeDeletes to skip merging a segment if it has < 10% deletions.

I think we should keep LogMergePolicy available for apps that want contiguous 
merging, merge by doc count, to not pro-rate by deletions, or to enforce a max 
segment size during optimize.  But, with this, I'd remove the non-contiguous 
support for LogMergePolicy that was added under LUCENE-1076, and make this new 
MP the default one.


> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
>
>                 Key: LUCENE-854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-854
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-854.patch
>
>
> The current merge policy, at every maxBufferedDocs *
> power-of-mergeFactor docs added, will do a fully cascaded merge, which
> is the same as an optimize.
> I think this is not good because at that "optimization poin", the
> particular addDocument call is [surprisingly] very expensive.  While,
> amortized over all addDocument calls, the cost is low, the cost is
> paid "up front" and in a very "bunched up" manner.
> I think of this as "pay it forward": you are paying the full cost of
> an optimize right now on the expectation / hope that you will be
> adding a great many more docs.  But, if you don't add that many more
> docs, then, the amortized cost for your index is in fact far higher
> than it should have been.  Better to "pay as you go" instead.
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor.  With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment.  Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.
> With this new merge policy, an index that's a bit bigger than a
> current "optimization point" would then have a lower amortized cost
> per document.  Plus the merge cost is less "bunched up" and less "pay
> it forward": instead you pay for what you are actually using.
> We can start by creating this merge policy (probably, combined with
> with the "by size not by doc count" segment level computation from
> LUCENE-845) and then later decide whether we should make it the
> default merge policy.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to