[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493065 ]
Michael McCandless commented on LUCENE-845: ------------------------------------------- > Following up on this, it's basically the idea that segments ought to be > created/merged both either by-segment-size or by-doc-count but not by a > mixture? That wouldn't be suprising ... Right, but we need the refactored merge policy framework in place first. I'll mark this issue dependent on LUCENE-847. > It does impact the APIs, though. It's easy enough to imagine, with factored > merge policies, both by-doc-count and by-segment-size policies. But the > initial segment creation is going to be handled by IndexWriter, so you have > to manually make sure you don't set that algorithm and the merge policy in > conflict. Not great, but I don't have any great ideas. Could put in an API > handshake, but I'm not sure if it's worth the mess? Good question. I think it's OK (at least for our first go at this -- progress not perfection!) to expect the developer to choose a merge policy and then to use IndexWriter in a way that's "consistent" with that merge policy? I think it's going to get too complex if we try to formally couple "when to flush/commit" with the merge policy? But, I think the default merge policy needs to be resilient to people doing things like changing maxBuffereDocs/mergeFactor partway through an index, calling flush() whenever they want, etc. The merge policy today is not resilient to these "normal" usages of IndexWriter. So I think we need to do something here even without the pressure from LUCENE-843. > Also, it sounds like, so far, there's no good way of managing parallel-reader > setups w/by-segment-size algorithms, since the algorithm for creating/merging > segments has to be globally consistent, not just per index, right? Right. We clearly need to keep the current "by doc" merge policy easily available for this use case. > If that is right, what does that say about making by-segment-size the > default? It's gonna break (as in bad results) people relying on that behavior > that don't change their code. Is there a community consensus on this? It's > not really an API change that would cause a compile/class-load failure, but > in some ways, it's worse ... I think there are actually two questions here: 1) What exactly makes for a good default merge policy? I think the merge policy we have today has some limitations: - It's not resilient to "normal" usage of the public APIs in IndexWriter. If you call flush() yourself, if you change maxBufferedDocs (and maybe mergeFactor?) partway through an index, etc, you can cause disastrous amounts of over-merging (that's this issue). I think the default policy should be entirely resilient to any usage of the public IndexWriter APIs. - Default merge policy should strive to minimize net cost (amortized over time) of merging, but the current one doesn't: - When docs differ in size (frequently the case) it will be too costly in CPU/IO consumption because small segments are merged with large ones. - It does too much work in advance (too much "pay it forward"). I don't think a merge policy should "inadvertently optimize" (I opened LUCENE-854 to describe this). - It blocks LUCENE-843 (flushing by RAM usage). I think Lucene "out of the box" should give you good indexing performance. You should not have to do extra tuning to get substantially better performance. The best way to get that is to "flush by RAM" (with LUCENE-843). But current merge policy prevents this (due to this issue). 2) Can we change the default merge policy? I sure hope so, given the issues above. I think the majority of Lucene users do the simple "create a writer, add/delete docs, close writer, while reader(s) use the same index" type of usage and so would benefit by the gained performance of LUCENE-843 and LUCENE-854. I think (but may be wrong!) it's a minority who use ParallelReader and therefore have a reliance on the specific merge policy we use today? Ideally we first commit the "decouple merge policy from IndexWriter" (LUCENE-847), then we would make a new merge policy that resolves this issue and LUCENE-854, and make it the default policy. I think this policy would look at size (in bytes) of each segment (perhaps proportionally reducing # bytes according to pending deletes against that segment), and would merge any adjacent segments (not just rightmost ones) that are "the most similar" in size. I think it would merge N (configurable) at a time and at no time would inadvertently optimize. This would mean users of ParallelReader on upgrading to this would need to change their merge policy to the legacy "merge by doc count" policy. > If you "flush by RAM usage" then IndexWriter may over-merge > ----------------------------------------------------------- > > Key: LUCENE-845 > URL: https://issues.apache.org/jira/browse/LUCENE-845 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > > I think a good way to maximize performance of Lucene's indexing for a > given amount of RAM is to flush (writer.flush()) the added documents > whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max > RAM you can afford. > But, this can confuse the merge policy and cause over-merging, unless > you set maxBufferedDocs properly. > This is because the merge policy looks at the current maxBufferedDocs > to figure out which segments are level 0 (first flushed) or level 1 > (merged from <mergeFactor> level 0 segments). > I'm not sure how to fix this. Maybe we can look at net size (bytes) > of a segment and "infer" level from this? Still we would have to be > resilient to the application suddenly increasing the RAM allowed. > The good news is to workaround this bug I think you just need to > ensure that your maxBufferedDocs is less than mergeFactor * > typical-number-of-docs-flushed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]