The merge selection (LogMergePolicy) tries to merge "roughly" equal
sized (measured in bytes) segments together, so it creates a "roughly"
log-staircase pattern.
I agree, in an NRT app, larger mergeFactor is likely best since it
minimizes reopen time overall. It's also important to
setMergedSegmentWarmer so a newly merged segment is merged (in the
background, with CMS) before being returned in a reopened NRT reader.
And making a custom MergeScheduler that defers big merges until "after
hours" should work well too...
On the impact of search performance for large vs small mergeFactors, I
think the jury is still out. People should keep testing that (and
report back!). Certainly, for the fastest reopen time you never want
any merging to be done :)
I think there are a number of good merge improvements in flight right
now:
* LUCENE-1750: limiting the max size of the merged segment
* LUCENE-1076: allow merge policy to select non-contiguous segments
* LUCENE-1737: always bulk-copy when merging -- the bulk copy
optimization makes merging the doc stores much faster now, but
it's a brittle optimization since it's sensitive to exactly which
fields, and in what order, you add to your docs
Other things we've talked about but no issues yet:
* Down prioritize all IO associated w/ merging. Java/OS doesn't
give us good support for this so I think we'd have to somehow
emulate in Lucene, at the Directory level.
* Don't let the IO from merging wipe the OS's IO cache. For this we
need to access madvise/posix_fadvise, which we don't have from
javaland, so I think we'd need an OS dependent, optional JNI
extension to do this.
Mike
On Thu, Jul 30, 2009 at 10:56 AM, Shai Erera<[email protected]> wrote:
> I think that when LUCENE-1750 is finished, you will be able to:
>
> 1) Create a MergePolicy that limits the segments size it's about to merge to
> a certain size.
> 2) Then have a daemon or something that runs on "idle" times and call
> optimize(maxNumSegments), or even open a new writer w/ the default merge
> policy and allow it to merge?
>
> Shai
>
> On Thu, Jul 30, 2009 at 5:48 PM, Grant Ingersoll <[email protected]>
> wrote:
>>
>> Note also response from Mike that talks a little bit about something along
>> these lines:
>> http://www.lucidimagination.com/search/document/fa990adba4d2572b/is_there_a_way_to_control_when_merges_happen#f6f0bfeef4bf9a39
>>
>> -Grant
>>
>> On Jul 30, 2009, at 10:35 AM, Grant Ingersoll wrote:
>>
>>> Given a large segment and a bunch of small segments, how does the
>>> ConcurrentMergeScheduler (CMS) work? Does it always merge the smaller
>>> segments into the bigger one, or does it merge the smaller segments
>>> together?
>>>
>>> Something I've been thinking about: Given a high update environment (and
>>> near real time, less than 1 minute, search constraints) and/or a very bursty
>>> environment, we've always said to keep the merge factor small for search
>>> reasons, at least in the high-update case. However, I've seen a couple of
>>> times where this causes problems because merges can take over and cause
>>> pauses, even with CMS, so I am wonder if it makes sense to have a larger
>>> merge factor (>10), knowing that I may have a few large segments and then a
>>> bunch of small ones and that the CMS will, in the background, be able to
>>> keep merging the smaller segments together and in most cases avoid ever
>>> having to merge into the large segments (b/c maybe I can just optimize down
>>> at slower times or even merge larger segments later. ) Seems like this
>>> would allow one to make sure larger merges need not take place, or at least
>>> reduce the chances of that happening.
>>>
>>> Not sure if I worded that correctly.
>>>
>>> Thanks,
>>> Grant
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]