[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Michael McCandless (JIRA) Wed, 02 May 2007 02:59:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493065
 ]


Michael McCandless commented on LUCENE-845:
-------------------------------------------

> Following up on this, it's basically the idea that segments ought to be 
> created/merged both either by-segment-size or by-doc-count but not by a 
> mixture? That wouldn't be suprising ...

Right, but we need the refactored merge policy framework in place
first.  I'll mark this issue dependent on LUCENE-847.

> It does impact the APIs, though. It's easy enough to imagine, with factored 
> merge policies, both by-doc-count and by-segment-size policies. But the 
> initial segment creation is going to be handled by IndexWriter, so you have 
> to manually make sure you don't set that algorithm and the merge policy in 
> conflict. Not great, but I don't have any great ideas. Could put in an API 
> handshake, but I'm not sure if it's worth the mess?

Good question.  I think it's OK (at least for our first go at this --
progress not perfection!) to expect the developer to choose a merge
policy and then to use IndexWriter in a way that's "consistent" with
that merge policy?  I think it's going to get too complex if we try to
formally couple "when to flush/commit" with the merge policy?

But, I think the default merge policy needs to be resilient to people
doing things like changing maxBuffereDocs/mergeFactor partway through
an index, calling flush() whenever they want, etc.  The merge policy
today is not resilient to these "normal" usages of IndexWriter.  So I
think we need to do something here even without the pressure from
LUCENE-843.

> Also, it sounds like, so far, there's no good way of managing parallel-reader 
> setups w/by-segment-size algorithms, since the algorithm for creating/merging 
> segments has to be globally consistent, not just per index, right?

Right.  We clearly need to keep the current "by doc" merge policy
easily available for this use case.

> If that is right, what does that say about making by-segment-size the 
> default? It's gonna break (as in bad results) people relying on that behavior 
> that don't change their code. Is there a community consensus on this? It's 
> not really an API change that would cause a compile/class-load failure, but 
> in some ways, it's worse ...

I think there are actually two questions here:

  1) What exactly makes for a good default merge policy?

     I think the merge policy we have today has some limitations:

       - It's not resilient to "normal" usage of the public APIs in
         IndexWriter.  If you call flush() yourself, if you change
         maxBufferedDocs (and maybe mergeFactor?) partway through an
         index, etc, you can cause disastrous amounts of over-merging
         (that's this issue).

         I think the default policy should be entirely resilient to
         any usage of the public IndexWriter APIs.

       - Default merge policy should strive to minimize net cost
         (amortized over time) of merging, but the current one
         doesn't:

         - When docs differ in size (frequently the case) it will be
           too costly in CPU/IO consumption because small segments are
           merged with large ones.

         - It does too much work in advance (too much "pay it
           forward").  I don't think a merge policy should
           "inadvertently optimize" (I opened LUCENE-854 to describe
           this).

       - It blocks LUCENE-843 (flushing by RAM usage).

         I think Lucene "out of the box" should give you good indexing
         performance.  You should not have to do extra tuning to get
         substantially better performance.  The best way to get that
         is to "flush by RAM" (with LUCENE-843).  But current merge
         policy prevents this (due to this issue).

  2) Can we change the default merge policy?

     I sure hope so, given the issues above.

     I think the majority of Lucene users do the simple "create a
     writer, add/delete docs, close writer, while reader(s) use the
     same index" type of usage and so would benefit by the gained
     performance of LUCENE-843 and LUCENE-854.

     I think (but may be wrong!) it's a minority who use
     ParallelReader and therefore have a reliance on the specific
     merge policy we use today?

Ideally we first commit the "decouple merge policy from IndexWriter"
(LUCENE-847), then we would make a new merge policy that resolves this
issue and LUCENE-854, and make it the default policy.  I think this
policy would look at size (in bytes) of each segment (perhaps
proportionally reducing # bytes according to pending deletes against
that segment), and would merge any adjacent segments (not just
rightmost ones) that are "the most similar" in size.  I think it would
merge N (configurable) at a time and at no time would inadvertently
optimize.

This would mean users of ParallelReader on upgrading to this would
need to change their merge policy to the legacy "merge by doc count"
policy.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Reply via email to