[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526403 ]
Michael McCandless commented on LUCENE-845: ------------------------------------------- In the latest patch on LUCENE-847 I've added methods to LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy (setMinMergeMB) to set a floor on the segment levels such that all segments below this min size are aggressively merged as if they were in one level. This effectively "truncates" what would otherwise be a long tail of segment sizes, when you are flushing many tiny segments into your index. In order to pick reasonable defaults for the min segment size, I ran some benchmarks to measure the indexing cost of truncating the tail. I processed Wiki content into ~4 KB plain text documents and then indexed the first 10,000 docs using this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker directory=FSDirectory docs.file=/lucene/wiki4K.txt max.buffered = 500 ResetSystemErase CreateIndex {AddDoc >: 10000 CloseIndex RepSumByName I'm using the SerialMergeScheduler. I modified contrib/benchmark to always flush a new segment after each added document: this simulates the "worst case" of tiny segments, ie, lowest latency indexing where every added doc must then be visible to searchers. Each time is best of 2 runs. This is run on Linux (2.6.22.1) Core II Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5 -server. maxBufferedDocs seconds slowdown 10 40 1.0 100 50 1.3 200 59 1.5 300 64 1.6 400 72 1.8 500 80 2.0 750 97 2.4 1000 114 2.9 1500 138 3.5 2000 169 4.2 3000 205 5.1 4000 264 6.6 5000 320 8.0 7500 404 10.1 10000 645 16.1 Here's my thinking: * If you are flushing zillions of such tiny segments I think it's OK to accept a net/net sizable slowdown of your overall indexing speed. I'll choose a 4X slowdown "tolerance" to choose default values. This corresponds roughly to the "2000" line above. However, because I tested on a fairly fast CPU & IO system I'll multiply the numbers by 0.5. * Given this, I propose we default the minMergeMB (LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at the 2000 point above was 3.2 MB) and default minMergeDocs (LogDocMergePolicy) to 1000. * Note that when you are flushing large segments (larger than these min size settings) then there is no slowdown at all because the flushed segments are already above the minimum size. These are just defaults, so a given application can always change their "min merge size" as needed. > If you "flush by RAM usage" then IndexWriter may over-merge > ----------------------------------------------------------- > > Key: LUCENE-845 > URL: https://issues.apache.org/jira/browse/LUCENE-845 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-845.patch > > > I think a good way to maximize performance of Lucene's indexing for a > given amount of RAM is to flush (writer.flush()) the added documents > whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max > RAM you can afford. > But, this can confuse the merge policy and cause over-merging, unless > you set maxBufferedDocs properly. > This is because the merge policy looks at the current maxBufferedDocs > to figure out which segments are level 0 (first flushed) or level 1 > (merged from <mergeFactor> level 0 segments). > I'm not sure how to fix this. Maybe we can look at net size (bytes) > of a segment and "infer" level from this? Still we would have to be > resilient to the application suddenly increasing the RAM allowed. > The good news is to workaround this bug I think you just need to > ensure that your maxBufferedDocs is less than mergeFactor * > typical-number-of-docs-flushed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]