[
https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499020
]
Michael McCandless commented on LUCENE-888:
-------------------------------------------
> I would like to know why these gains are appearing, and how specific
> they are to a particular system. How can the optimum buffer size be
> deduced? Is it a factor of hard disk sector size? Memory page size?
> Lucene write behavior pattern? Level X Cache size?
It looks like the gains are cross platform (at least between OS X,
Linux, Windows XP) and cross-IO architecture.
I'm not sure how this depends/correlates to the various cache/page
sizes through the layers of OS -> disk heads.
It must be that doing an IO request has a fairly high overhead and so
the more bytes you can read/write at once the faster it is, since you
amortize that overhead.
For merging in particular, with mergeFactor=10, I can see that a
larger buffer size on the input streams should help reduce insane
seeks back & forth between the 10 files (and the 1 output file).
Maybe larger reads on the input streams also cause OS's IO scheduler
to do larger read-ahead in anticipation?
And some good news: these gains seem to be additive to the gains in
LUCENE-843, at least with my initial testing.
> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
> Key: LUCENE-888
> URL: https://issues.apache.org/jira/browse/LUCENE-888
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.1
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput). Second is CompoundFileWriter's buffer used to
> actually build the compound file. Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843. I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English). I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields. I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845. The resulting index is 1.7 GB. The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now. During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there. And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]