The merge selection (LogMergePolicy) tries to merge "roughly" equal
sized (measured in bytes) segments together, so it creates a "roughly"
log-staircase pattern.

I agree, in an NRT app, larger mergeFactor is likely best since it
minimizes reopen time overall.  It's also important to
setMergedSegmentWarmer so a newly merged segment is merged (in the
background, with CMS) before being returned in a reopened NRT reader.
And making a custom MergeScheduler that defers big merges until "after
hours" should work well too...

On the impact of search performance for large vs small mergeFactors, I
think the jury is still out.  People should keep testing that (and
report back!).  Certainly, for the fastest reopen time you never want
any merging to be done :)

I think there are a number of good merge improvements in flight right
now:

  * LUCENE-1750: limiting the max size of the merged segment

  * LUCENE-1076: allow merge policy to select non-contiguous segments

  * LUCENE-1737: always bulk-copy when merging -- the bulk copy
    optimization makes merging the doc stores much faster now, but
    it's a brittle optimization since it's sensitive to exactly which
    fields, and in what order, you add to your docs

Other things we've talked about but no issues yet:

  * Down prioritize all IO associated w/ merging.  Java/OS doesn't
    give us good support for this so I think we'd have to somehow
    emulate in Lucene, at the Directory level.

  * Don't let the IO from merging wipe the OS's IO cache.  For this we
    need to access madvise/posix_fadvise, which we don't have from
    javaland, so I think we'd need an OS dependent, optional JNI
    extension to do this.

Mike

On Thu, Jul 30, 2009 at 10:56 AM, Shai Erera<ser...@gmail.com> wrote:
> I think that when LUCENE-1750 is finished, you will be able to:
>
> 1) Create a MergePolicy that limits the segments size it's about to merge to
> a certain size.
> 2) Then have a daemon or something that runs on "idle" times and call
> optimize(maxNumSegments), or even open a new writer w/ the default merge
> policy and allow it to merge?
>
> Shai
>
> On Thu, Jul 30, 2009 at 5:48 PM, Grant Ingersoll <gsing...@apache.org>
> wrote:
>>
>> Note also response from Mike that talks a little bit about something along
>> these lines:
>> http://www.lucidimagination.com/search/document/fa990adba4d2572b/is_there_a_way_to_control_when_merges_happen#f6f0bfeef4bf9a39
>>
>> -Grant
>>
>> On Jul 30, 2009, at 10:35 AM, Grant Ingersoll wrote:
>>
>>> Given a large segment and a bunch of small segments, how does the
>>> ConcurrentMergeScheduler (CMS) work?  Does it always merge the smaller
>>> segments into the bigger one, or does it merge the smaller segments
>>> together?
>>>
>>> Something I've been thinking about:  Given a high update environment (and
>>> near real time, less than 1 minute, search constraints) and/or a very bursty
>>> environment, we've always said to keep the merge factor small for search
>>> reasons, at least in the high-update case.  However, I've seen a couple of
>>> times where this causes problems because merges can take over and cause
>>> pauses, even with CMS, so I am wonder if it makes sense to have a larger
>>> merge factor (>10), knowing that I may have a few large segments and then a
>>> bunch of small ones and that the CMS will, in the background, be able to
>>> keep merging the smaller segments together and in most cases avoid ever
>>> having to merge into the large segments (b/c maybe I can just optimize down
>>> at slower times or even merge larger segments later. )   Seems like this
>>> would allow one to make sure larger merges need not take place, or at least
>>> reduce the chances of that happening.
>>>
>>> Not sure if I worded that correctly.
>>>
>>> Thanks,
>>> Grant
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to