The merge selection (LogMergePolicy) tries to merge "roughly" equal sized (measured in bytes) segments together, so it creates a "roughly" log-staircase pattern.
I agree, in an NRT app, larger mergeFactor is likely best since it minimizes reopen time overall. It's also important to setMergedSegmentWarmer so a newly merged segment is merged (in the background, with CMS) before being returned in a reopened NRT reader. And making a custom MergeScheduler that defers big merges until "after hours" should work well too... On the impact of search performance for large vs small mergeFactors, I think the jury is still out. People should keep testing that (and report back!). Certainly, for the fastest reopen time you never want any merging to be done :) I think there are a number of good merge improvements in flight right now: * LUCENE-1750: limiting the max size of the merged segment * LUCENE-1076: allow merge policy to select non-contiguous segments * LUCENE-1737: always bulk-copy when merging -- the bulk copy optimization makes merging the doc stores much faster now, but it's a brittle optimization since it's sensitive to exactly which fields, and in what order, you add to your docs Other things we've talked about but no issues yet: * Down prioritize all IO associated w/ merging. Java/OS doesn't give us good support for this so I think we'd have to somehow emulate in Lucene, at the Directory level. * Don't let the IO from merging wipe the OS's IO cache. For this we need to access madvise/posix_fadvise, which we don't have from javaland, so I think we'd need an OS dependent, optional JNI extension to do this. Mike On Thu, Jul 30, 2009 at 10:56 AM, Shai Erera<ser...@gmail.com> wrote: > I think that when LUCENE-1750 is finished, you will be able to: > > 1) Create a MergePolicy that limits the segments size it's about to merge to > a certain size. > 2) Then have a daemon or something that runs on "idle" times and call > optimize(maxNumSegments), or even open a new writer w/ the default merge > policy and allow it to merge? > > Shai > > On Thu, Jul 30, 2009 at 5:48 PM, Grant Ingersoll <gsing...@apache.org> > wrote: >> >> Note also response from Mike that talks a little bit about something along >> these lines: >> http://www.lucidimagination.com/search/document/fa990adba4d2572b/is_there_a_way_to_control_when_merges_happen#f6f0bfeef4bf9a39 >> >> -Grant >> >> On Jul 30, 2009, at 10:35 AM, Grant Ingersoll wrote: >> >>> Given a large segment and a bunch of small segments, how does the >>> ConcurrentMergeScheduler (CMS) work? Does it always merge the smaller >>> segments into the bigger one, or does it merge the smaller segments >>> together? >>> >>> Something I've been thinking about: Given a high update environment (and >>> near real time, less than 1 minute, search constraints) and/or a very bursty >>> environment, we've always said to keep the merge factor small for search >>> reasons, at least in the high-update case. However, I've seen a couple of >>> times where this causes problems because merges can take over and cause >>> pauses, even with CMS, so I am wonder if it makes sense to have a larger >>> merge factor (>10), knowing that I may have a few large segments and then a >>> bunch of small ones and that the CMS will, in the background, be able to >>> keep merging the smaller segments together and in most cases avoid ever >>> having to merge into the large segments (b/c maybe I can just optimize down >>> at slower times or even merge larger segments later. ) Seems like this >>> would allow one to make sure larger merges need not take place, or at least >>> reduce the chances of that happening. >>> >>> Not sure if I worded that correctly. >>> >>> Thanks, >>> Grant >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org