[ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520611
 ] 

Yonik Seeley commented on LUCENE-845:
-------------------------------------

Merging small segments in the reader seems like a cool idea on it's own.
But if it's an acceptable hit to merge in the reader, why is it not in the 
writer?

Think about a writer flushing 10 small segments and a new reader opened each 
time:
The reader would do ~10*10/2 merges if it just merged the small segments.
If the writer were to do the merging instead, it would need to merge ~10 
segments.

Thinking about it anotherway... if there were no separation between reader and 
writer, and small segments were merged on an open, why not just write out the 
result so it wouldn't have to be done again?  Now move "merge on an open" to 
"merge on the close" and that's what IndexWriter currently does.  Why is it OK 
for a reader to pay the price but not the writer?

Also, would this tail merging on an open be able to reduce the peak number of 
file descriptors?
It seems like to do so, the tail would have to be merged *before* other index 
files were opened, further complicating matters.


> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to