[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520649 ]
Michael McCandless commented on LUCENE-845: ------------------------------------------- > Merging small segments in the reader seems like a cool idea on it's > own. But if it's an acceptable hit to merge in the reader, why is > it not in the writer? Good point. I think it comes down to how often we expect readers to refresh vs writers flushing. If indeed it's 1 to 1 (as the truest "low latency" app would in fact be, or a "single writer + reader with no separation"), then the writer should merge them because although it's paying an O(N^2) cost to keep the tail "short", merging on open would pay even more cost. But if writer flushes frequently and reader re-opens less frequently then it's better to merge on open. Of course, if the O(N^2) cost for IndexWriter to keep a short tail is in practice not too costly then we should just leave this in IndexWriter. I still need to run that test for LUCENE-845. > Also, would this tail merging on an open be able to reduce the peak > number of file descriptors? It seems like to do so, the tail would > have to be merged *before* other index files were opened, further > complicating matters. Right I think to keep peak descriptor usage capped we must merge the tail, first, then open the remaining segments, which definitely complicate things... > If you "flush by RAM usage" then IndexWriter may over-merge > ----------------------------------------------------------- > > Key: LUCENE-845 > URL: https://issues.apache.org/jira/browse/LUCENE-845 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-845.patch > > > I think a good way to maximize performance of Lucene's indexing for a > given amount of RAM is to flush (writer.flush()) the added documents > whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max > RAM you can afford. > But, this can confuse the merge policy and cause over-merging, unless > you set maxBufferedDocs properly. > This is because the merge policy looks at the current maxBufferedDocs > to figure out which segments are level 0 (first flushed) or level 1 > (merged from <mergeFactor> level 0 segments). > I'm not sure how to fix this. Maybe we can look at net size (bytes) > of a segment and "infer" level from this? Still we would have to be > resilient to the application suddenly increasing the RAM allowed. > The good news is to workaround this bug I think you just need to > ensure that your maxBufferedDocs is less than mergeFactor * > typical-number-of-docs-flushed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]