[ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509576 ]
Michael McCandless commented on LUCENE-856: ------------------------------------------- I ran a new performance comparison here to test the merging cost of autoCommit false vs true, this time using Wikipedia content and contrib/benchmark. I indexed all of Wikipedia using the patch from LUCENE-843 and the patch from LUCENE-947, once with autoCommit=true and once with autoCommit=false. I used this alg (and just changed autocommit=true to false for the second test): max.field.length=2147483647 compound=false analyzer=org.apache.lucene.analysis.SimpleAnalyzer directory=FSDirectory ram.flush.mb=32 max.buffered=20000 autocommit=true doc.stored=true doc.tokenized=true doc.term.vector=true doc.term.vector.offsets=true doc.term.vector.positions=true doc.add.log.step=500 docs.dir=enwiki doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker doc.maker.forever=false ResetSystemErase CreateIndex [{AddDoc}: *] : 4 CloseIndex RepSumByPref AddDoc Which means: use 4 threads to index all text from each of the 3.2 million wikipedia docs, with stored fields & term vectors turned on, using SimpleAnalyzer, flushing when RAM usage hits 32 MB. The index size is 20 GB. Report from autoCommit=true: ------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073) Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem AddDoc 0 3204066 1 226.3 14,159.22 282,843,296 373,480,960 Net elapsed time = 87 minutes 18 seconds Report from autoCommit=false: ------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073) Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem AddDoc 0 3204066 1 407.6 7,860.63 252,046,000 329,962,048 Net elapsed time = 60 minutes 5 seconds Some comments: * According to net elapsed time, autoCommit=false is 31% faster than autoCommit=true. * According to "rec/s" it's actually 44% faster; this is because rec/s only measures the actual addDocument time and not eg the IO cost of retrieving the document contents. * The speedup is due entirely to the fact that the "doc stores" (vectors & stored fields) do not need to be merged when autoCommit=false. This is a major win because these files are enormous if you turn on stored fields & term vectors with offsets & positions. * The basic conclusion is the same as before: if you want to build up a large index, and, it's not necessary to be searching this index while you are building it, the fastest way to do so is with LUCENE-843 patch and with autoCommit=false. > Optimize segment merging > ------------------------ > > Key: LUCENE-856 > URL: https://issues.apache.org/jira/browse/LUCENE-856 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > > With LUCENE-843, the time spent indexing documents has been > substantially reduced and now the time spent merging is a sizable > portion of indexing time. > I ran a test using the patch for LUCENE-843, building an index of 10 > million docs, each with ~5,500 byte plain text, with term vectors > (positions + offsets) on and with 2 small stored fields per document. > RAM buffer size was 32 MB. I didn't optimize the index in the end, > though optimize speed would also improve if we optimize segment > merging. Index size is 86 GB. > Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes > of which was spent merging. That's 65.6% of the time! > Most of this time is presumably IO which probably can't be reduced > much unless we improve overall merge policy and experiment with values > for mergeFactor / buffer size. > These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO > system is RAID 0 of 4 drives, so, these times are probably better than > the more common case of a single hard drive which would likely be > slower IO. > I think there are some simple things we could do to speed up merging: > * Experiment with buffer sizes -- maybe larger buffers for the > IndexInputs used during merging could help? Because at a default > mergeFactor of 10, the disk heads must do alot of seeking back and > forth between these 10 files (and then to the 11th file where we > are writing). > * Use byte copying when possible, eg if there are no deletions on a > segment we can almost (I think?) just copy things like prox > postings, stored fields, term vectors, instead of full parsing to > Jave objects and then re-serializing them. > * Experiment with mergeFactor / different merge policies. For > example I think LUCENE-854 would reduce time spend merging for a > given index size. > This is currently just a place to list ideas for optimizing segment > merges. I don't plan on working on this until after LUCENE-843. > Note that for "autoCommit=false", this optimization is somewhat less > important, depending on how often you actually close/open a new > IndexWriter. In the extreme case, if you open a writer, add 100 MM > docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]