[ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526403
 ] 

Michael McCandless commented on LUCENE-845:
-------------------------------------------

In the latest patch on LUCENE-847 I've added methods to
LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy
(setMinMergeMB) to set a floor on the segment levels such that all
segments below this min size are aggressively merged as if they were in
one level.  This effectively "truncates" what would otherwise be a
long tail of segment sizes, when you are flushing many tiny segments
into your index.

In order to pick reasonable defaults for the min segment size, I ran
some benchmarks to measure the indexing cost of truncating the tail.

I processed Wiki content into ~4 KB plain text documents and then
indexed the first 10,000 docs using this alg:

  analyzer=org.apache.lucene.analysis.SimpleAnalyzer
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  directory=FSDirectory
  docs.file=/lucene/wiki4K.txt
  max.buffered = 500

  ResetSystemErase
  CreateIndex
  {AddDoc >: 10000
  CloseIndex

  RepSumByName

I'm using the SerialMergeScheduler.

I modified contrib/benchmark to always flush a new segment after each
added document: this simulates the "worst case" of tiny segments, ie,
lowest latency indexing where every added doc must then be visible to
searchers.

Each time is best of 2 runs.  This is run on Linux (2.6.22.1) Core II
Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5
-server.

    maxBufferedDocs   seconds    slowdown
    10                40         1.0
    100               50         1.3
    200               59         1.5
    300               64         1.6
    400               72         1.8
    500               80         2.0
    750               97         2.4
   1000              114         2.9
   1500              138         3.5
   2000              169         4.2
   3000              205         5.1
   4000              264         6.6
   5000              320         8.0
   7500              404        10.1
  10000              645        16.1

Here's my thinking:

  * If you are flushing zillions of such tiny segments I think it's OK
    to accept a net/net sizable slowdown of your overall indexing
    speed.  I'll choose a 4X slowdown "tolerance" to choose default
    values.  This corresponds roughly to the "2000" line above.
    However, because I tested on a fairly fast CPU & IO system I'll
    multiply the numbers by 0.5.

  * Given this, I propose we default the minMergeMB
    (LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at
    the 2000 point above was 3.2 MB) and default minMergeDocs
    (LogDocMergePolicy) to 1000.

  * Note that when you are flushing large segments (larger than these
    min size settings) then there is no slowdown at all because the
    flushed segments are already above the minimum size.

These are just defaults, so a given application can always change
their "min merge size" as needed.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to