[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502793 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- I ran a benchmark using more than 1 thread to do indexing, in order to test & compare concurrency of trunk and the patch. The test is the same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive RAID 0 IO system. Here are the raw results: DOCS = ~5,500 bytes plain text RAM = 32 MB MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) NUM THREADS = 1 new 200000 docs in 172.3 secs index size = 1.7G old 200000 docs in 539.5 secs index size = 1.7G Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster] Docs/MB @ flush: old 47.9; new 334.6 [ 598.7% more] Avg RAM used (MB) @ flush: old 131.9; new 33.1 [ 74.9% less] NUM THREADS = 2 new 200001 docs in 130.8 secs index size = 1.7G old 200001 docs in 452.8 secs index size = 1.7G Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster] Docs/MB @ flush: old 47.9; new 301.5 [ 529.7% more] Avg RAM used (MB) @ flush: old 226.1; new 35.2 [ 84.4% less] NUM THREADS = 3 new 200002 docs in 105.4 secs index size = 1.7G old 200002 docs in 428.4 secs index size = 1.7G Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster] Docs/MB @ flush: old 47.9; new 277.8 [ 480.2% more] Avg RAM used (MB) @ flush: old 289.8; new 37.0 [ 87.2% less] NUM THREADS = 4 new 200003 docs in 104.8 secs index size = 1.7G old 200003 docs in 440.4 secs index size = 1.7G Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster] Docs/MB @ flush: old 47.9; new 259.9 [ 442.9% more] Avg RAM used (MB) @ flush: old 293.7; new 37.1 [ 87.3% less] NUM THREADS = 5 new 200004 docs in 99.5 secs index size = 1.7G old 200004 docs in 425.0 secs index size = 1.7G Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster] Docs/MB @ flush: old 47.9; new 245.3 [ 412.6% more] Avg RAM used (MB) @ flush: old 390.9; new 38.3 [ 90.2% less] NUM THREADS = 6 new 200005 docs in 106.3 secs index size = 1.7G old 200005 docs in 427.1 secs index size = 1.7G Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster] Docs/MB @ flush: old 47.8; new 248.5 [ 419.3% more] Avg RAM used (MB) @ flush: old 340.9; new 38.7 [ 88.6% less] NUM THREADS = 7 new 200006 docs in 106.1 secs index size = 1.7G old 200006 docs in 435.2 secs index size = 1.7G Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster] Docs/MB @ flush: old 47.8; new 248.7 [ 420.0% more] Avg RAM used (MB) @ flush: old 408.6; new 39.1 [ 90.4% less] NUM THREADS = 8 new 200007 docs in 109.0 secs index size = 1.7G old 200007 docs in 469.2 secs index size = 1.7G Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster] Docs/MB @ flush: old 47.8; new 251.3 [ 425.5% more] Avg RAM used (MB) @ flush: old 448.9; new 39.0 [ 91.3% less] Some quick comments: * Both trunk & the patch show speedups if you use more than 1 thread to do indexing. This is expected since the machine has concurrency. * The biggest speedup is from 1->2 threads but still good gains from 2->5 threads. * Best seems to be 5 threads. * The patch allows better concurrency: relatively speaking it speeds up faster than the trunk (the % faster increases as we add threads) as you increase # threads. I think this makes sense because we flush less often with the patch, and, flushing is time consuming and single threaded. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]