[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- OK I ran old (trunk) vs new (this patch) with increasing RAM buffer sizes up to 96 MB. I used the "normal" sized docs (~5,500 bytes plain text), left stored fields and term vectors (positions + offsets) on, and autoCommit=false. Here're the results: NUM THREADS = 1 MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) 1 MB old 200000 docs in 862.2 secs index size = 1.7G new 200000 docs in 297.1 secs index size = 1.7G Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] Docs/MB @ flush: old 47.2; new 278.4 [ 489.6% more] Avg RAM used (MB) @ flush: old 34.5; new 3.4 [ 90.1% less] 2 MB old 200000 docs in 828.7 secs index size = 1.7G new 200000 docs in 279.0 secs index size = 1.7G Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] Docs/MB @ flush: old 47.0; new 322.4 [ 586.7% more] Avg RAM used (MB) @ flush: old 37.9; new 4.5 [ 88.0% less] 4 MB old 200000 docs in 840.5 secs index size = 1.7G new 200000 docs in 260.8 secs index size = 1.7G Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] Docs/MB @ flush: old 46.8; new 363.1 [ 675.4% more] Avg RAM used (MB) @ flush: old 33.9; new 6.5 [ 80.9% less] 8 MB old 200000 docs in 678.8 secs index size = 1.7G new 200000 docs in 248.8 secs index size = 1.7G Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] Docs/MB @ flush: old 46.8; new 392.4 [ 739.1% more] Avg RAM used (MB) @ flush: old 60.3; new 10.7 [ 82.2% less] 16 MB old 200000 docs in 660.6 secs index size = 1.7G new 200000 docs in 247.3 secs index size = 1.7G Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] Docs/MB @ flush: old 46.7; new 415.4 [ 788.8% more] Avg RAM used (MB) @ flush: old 47.1; new 19.2 [ 59.3% less] 24 MB old 200000 docs in 658.1 secs index size = 1.7G new 200000 docs in 243.0 secs index size = 1.7G Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] Docs/MB @ flush: old 46.7; new 430.9 [ 822.2% more] Avg RAM used (MB) @ flush: old 70.0; new 27.5 [ 60.8% less] 32 MB old 200000 docs in 714.2 secs index size = 1.7G new 200000 docs in 239.2 secs index size = 1.7G Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old 92.5; new 36.7 [ 60.3% less] 48 MB old 200000 docs in 640.3 secs index size = 1.7G new 200000 docs in 236.0 secs index size = 1.7G Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] Docs/MB @ flush: old 46.7; new 438.5 [ 838.8% more] Avg RAM used (MB) @ flush: old 138.9; new 52.8 [ 62.0% less] 64 MB old 200000 docs in 649.3 secs index size = 1.7G new 200000 docs in 238.3 secs index size = 1.7G Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] Docs/MB @ flush: old 46.7; new 441.3 [ 844.7% more] Avg RAM used (MB) @ flush: old 302.6; new 72.7 [ 76.0% less] 80 MB old 200000 docs in 670.2 secs index size = 1.7G new 200000 docs in 227.2 secs index size = 1.7G Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] Docs/MB @ flush: old 46.7; new 446.2 [ 855.2% more] Avg RAM used (MB) @ flush: old 231.7; new 94.3 [ 59.3% less] 96 MB old 200000 docs in 683.4 secs index size = 1.7G new 200000 docs in 226.8 secs index size = 1.7G Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster] Docs/MB @ flush: old 46.7; new 448.0 [ 859.1% more] Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less] Some observations: * Remember the test is already biased against "new" because with the patch you get an optimized index in the end but with "old" you don't. * Sweet spot for old (trunk) seems to be 48 MB: that is the peak docs/sec @ 312.4. * New (with patch) seems to just get faster the more memory you give it, though gradually. The peak was 96 MB (the largest I ran). So no sweet spot (or maybe I need to give more memory, but, above 96 MB the trunk was starting to swap on my test env). * New gets better and better RAM efficiency, the more RAM you give. This makes sense: it's better able to compress the terms dict, the more docs are merged in RAM before having to flush to disk. I would also expect this curve to be somewhat content dependent. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]