Improve indexing performance by increasing internal buffer sizes ----------------------------------------------------------------
Key: LUCENE-888 URL: https://issues.apache.org/jira/browse/LUCENE-888 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor In working on LUCENE-843, I noticed that two buffer sizes have a substantial impact on overall indexing performance. First is BufferedIndexOutput.BUFFER_SIZE (also used by BufferedIndexInput). Second is CompoundFileWriter's buffer used to actually build the compound file. Both are now 1 KB (1024 bytes). I ran the same indexing test I'm using for LUCENE-843. I'm indexing ~5,500 byte plain text docs derived from the Europarl corpus (English). I index 200,000 docs with compound file enabled and term vector positions & offsets stored plus stored fields. I flush documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to not hit LUCENE-845. The resulting index is 1.7 GB. The index is not optimized in the end and I left mergeFactor @ 10. I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO system. At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if I increase both buffers to 8 KB it takes 554 sec to build the index, which is an 11% overall gain! I will run more tests to see if there is a natural knee in the curve (buffer size above which we don't really gain much more performance). I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE at 1024, at least for now. During searching there can be quite a few of this class instantiated, and likely a larger buffer size for the freq/prox streams could actually hurt search performance for those searches that use skipping. The CompoundFileWriter buffer is created only briefly, so I think we can use a fairly large (32 KB?) buffer there. And there should not be too many BufferedIndexOutputs alive at once so I think a large-ish buffer (16 KB?) should be OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]