[
https://issues.apache.org/jira/browse/LUCENE-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-854:
--------------------------------------
Attachment: LUCENE-854.patch
I created a new merge policy, to take advantage of non-contiguous merging
(LUCENE-1076) and fix certain limitations of LogMergePolicy.
The new policy does not support contiguous merging, and always merges according
to byte size, always pro rated by pct deletes.
The policy's core logic is similar to LogMP, in that it tries to merge roughly
equal sized segments at once, maxMergeAtOnce (renamed from mergeFactor) at a
time, resulting in the usual exponential staircase pattern when you feed it
roughly equal sized segments.
You configure the approx max merged segment size (unlike LogMP where you
configure the max to-be-merged size, which was always a source of confusion!).
Unlike LogMP, when segments are getting close to being too large, the new
policy will merge fewer segs, eg down to merging pairwise, to reach approx the
max allowed size. This is important since it makes that setting more
"accurate"; I now default it to 5 GB (vs LogMP's 2 GB).
There is a separate maxMergeAtOnceExplicit that controls "explicit" merging
(ie, app calls optimize or expungeDeletes, and maybe in the future also
addIndexes); I defaulted it to 30. There is no max segment size for optimize.
The big difference vs LogMP is that the new policy does not "over-merge",
meaning it does not "pay it forward"/forcefully cascade the way LogMP does
today. This fixes the "inadvertent optimize" that LogMP does.
For any given sized index, the new policy computes a budget of how many
segments that index is allowed to have (ie, it enumerates the steps in the
stair case, based on mergeAtOnce, [floored] min segment size, and total bytes
in the index); then, if the index is over-budget it picks the least-cost merge.
This results in a smoother progression over time of number of segments.
There is a new configuration, segmentsPerTier, that lets you control how many
segments per level you can "tolerate". This is a nice knob to turn to tradeoff
merge cost vs search cost. It defaults to 10, which means it matches the
staircase pattern that LogMP produces, but you can now separately control the
"width" of the stairs in the staircase, from how many segments are merged at
once for non-explicit merges.
It has useCompoundFile and noCFSRatio just like LogMP.
It has a new setting "expungeDeletesPctAllowed", default 10%, which allows
expungeDeletes to skip merging a segment if it has < 10% deletions.
I think we should keep LogMergePolicy available for apps that want contiguous
merging, merge by doc count, to not pro-rate by deletions, or to enforce a max
segment size during optimize. But, with this, I'd remove the non-contiguous
support for LogMergePolicy that was added under LUCENE-1076, and make this new
MP the default one.
> Create merge policy that doesn't periodically inadvertently optimize
> --------------------------------------------------------------------
>
> Key: LUCENE-854
> URL: https://issues.apache.org/jira/browse/LUCENE-854
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-854.patch
>
>
> The current merge policy, at every maxBufferedDocs *
> power-of-mergeFactor docs added, will do a fully cascaded merge, which
> is the same as an optimize.
> I think this is not good because at that "optimization poin", the
> particular addDocument call is [surprisingly] very expensive. While,
> amortized over all addDocument calls, the cost is low, the cost is
> paid "up front" and in a very "bunched up" manner.
> I think of this as "pay it forward": you are paying the full cost of
> an optimize right now on the expectation / hope that you will be
> adding a great many more docs. But, if you don't add that many more
> docs, then, the amortized cost for your index is in fact far higher
> than it should have been. Better to "pay as you go" instead.
> So we could make a small change to the policy by only merging the
> first mergeFactor segments once we hit 2X the merge factor. With
> mergeFactor=10, when we have created the 20th level 0 (just flushed)
> segment, we merge the first 10 into a level 1 segment. Then on
> creating another 10 level 0 segments, we merge the second set of 10
> level 0 segments into a level 1 segment, etc.
> With this new merge policy, an index that's a bit bigger than a
> current "optimization point" would then have a lower amortized cost
> per document. Plus the merge cost is less "bunched up" and less "pay
> it forward": instead you pay for what you are actually using.
> We can start by creating this merge policy (probably, combined with
> with the "by size not by doc count" segment level computation from
> LUCENE-845) and then later decide whether we should make it the
> default merge policy.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]