[ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499018 ]
Michael McCandless commented on LUCENE-888: ------------------------------------------- OK I ran two sets of tests. First is only on Mac OS X to see how performance changes with buffer sizes. Second was also on Debian Linux & Windows XP Pro. The performance gains are 10-18% faster overall. FIRST TEST I increased buffer sizes, separately, for each of BufferedIndexInput, BufferedIndexOutput and CompoundFileWriter. Each test is run once on Mac OS X: BufferedIndexInput 1 K 622 sec (current trunk) 4 K 607 sec 8 K 606 sec 16 K 598 sec 32 K 606 sec 64 K 589 sec 128 K 601 sec CompoundFileWriter 1 K 622 sec (current trunk) 4 K 599 sec 8 K 591 sec 16 K 578 sec 32 K 583 sec 64 K 580 sec BufferedIndexOutput 1 K 622 sec (current trunk) 4 K 588 sec 8 K 576 sec 16 K 551 sec 32 K 566 sec 64 K 555 sec 128 K 543 sec 256 K 534 sec 512 K 564 sec Comments: * The results are fairly noisy, but, performance does generally get better w/ larger buffers. * BufferedIndexOutput seems specifically to like very large output buffers; the other two seem to have less but still significant effect. Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think we would get faster performance for a larger buffer for BufferedIndexInput, but, even when merging there are quite a few of these created (mergeFactor * N where N = number of separate index files). Then, I re-tested the baseline (trunk) & these buffer sizes across platforms (below): SECOND TEST Baseline (trunk) = 1 K buffers for all 3. New = 16 K for BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for BufferedIndexInput. I ran each test 4 times & took the best time: Quad core Mac OS X on 4-drive RAID 0 baseline 622 sec new 527 sec -> 15% faster Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5 baseline 708 sec new 635 sec -> 10% faster Windows XP Pro laptop, single drive baseline 1604 sec new 1308 sec -> 18% faster Net/net it's between 10-18% performance gain overall. It is interesting that the system with the "weakest" IO system (one drive on Windows XP vs RAID 0/5 on the others) has the best gains. > Improve indexing performance by increasing internal buffer sizes > ---------------------------------------------------------------- > > Key: LUCENE-888 > URL: https://issues.apache.org/jira/browse/LUCENE-888 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > > In working on LUCENE-843, I noticed that two buffer sizes have a > substantial impact on overall indexing performance. > First is BufferedIndexOutput.BUFFER_SIZE (also used by > BufferedIndexInput). Second is CompoundFileWriter's buffer used to > actually build the compound file. Both are now 1 KB (1024 bytes). > I ran the same indexing test I'm using for LUCENE-843. I'm indexing > ~5,500 byte plain text docs derived from the Europarl corpus > (English). I index 200,000 docs with compound file enabled and term > vector positions & offsets stored plus stored fields. I flush > documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to > not hit LUCENE-845. The resulting index is 1.7 GB. The index is not > optimized in the end and I left mergeFactor @ 10. > I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO > system. > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if > I increase both buffers to 8 KB it takes 554 sec to build the index, > which is an 11% overall gain! > I will run more tests to see if there is a natural knee in the curve > (buffer size above which we don't really gain much more performance). > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > at 1024, at least for now. During searching there can be quite a few > of this class instantiated, and likely a larger buffer size for the > freq/prox streams could actually hurt search performance for those > searches that use skipping. > The CompoundFileWriter buffer is created only briefly, so I think we > can use a fairly large (32 KB?) buffer there. And there should not be > too many BufferedIndexOutputs alive at once so I think a large-ish > buffer (16 KB?) should be OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]