[jira] Updated: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Michael McCandless (JIRA) Wed, 15 Aug 2007 16:59:51 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-845:
--------------------------------------

    Attachment: LUCENE-845.patch


First cut patch.  You have to first apply the most recent patch from
LUCENE-847:

  https://issues.apache.org/jira/secure/attachment/12363880/LUCENE-847.patch.txt

and then apply this patch over it.

This patch has two merge policies:

  LogDocMergePolicy

    This is "backwards compatible" to current merge policy, yet,
    resolve this "over-merge issue" by not using the current setting
    of "maxBufferedDocs" when computing levels.  I think it should
    replace the current LogDocMergePolicy from LUCENE-847.

  LogByteSizeMergePolicy

    Chooses merges according to net size in bytes of all files for a
    segment.  I think we should make this one the default merge
    policy, and also change IndexWriter to flush by RAM by default.

They both subclass from abstract base LogMergePolicy and differ only
in the "size" method which defines how you measure a segment's size (#
docs in that segment or net size in bytes of that segment).

The gist of the approach is the same as the current merge policy: you
generally try to merge segments that are "roughly" the same size
(where size can be doc count or byte size), mergeFactor at a time.

The big difference is instead of starting from maxBufferedDocs and
"going up" to determine level, I start from the max segment size (of
all segments in the index) and "go down" to determine level.  This
resolves the bug because levels are "self-defined" by the segments,
rather than by the current value of maxBufferedDocs on IndexWriter.

I then pick merges exactly the same as the current merge policy: if
any level has >= mergeFactor segments, we merge them.

All tests pass, except:

  * One assert in testAddIndexesNoOptimize which was relying on the
    specific invariants of the current merge policy (it's the same
    assert that LUCENE-847 had changed; this assert is testing
    particular corner cases of the current merge policy).  Changing
    the assertEquals to "4" instead of "3" fixes it.

  * TestLogDocMergePolicy (added in LUCENE-847) doesn't compile
    against the new version above because it's using methods that
    don't exist in the new one.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Reply via email to