[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

John Haxby (JIRA) Fri, 25 May 2007 04:25:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499044
 ]


John Haxby commented on LUCENE-888:
-----------------------------------

> Net/net it's between 10-18% performance gain overall. It is
> interesting that the system with the "weakest" IO system (one drive on
> Windows XP vs RAID 0/5 on the others) has the best gains.

Actually, it's not that surprising.  Linux and BSD (MacOS) kernels work hard to 
do good I/O without the user having to do that much to take it into account.   
The improvement you're seeing in those systems is as much to do with the fact 
that you're dealing with complete file system block sizes (4x4k) and complete 
VM page sizes (4x4k).   You'd probably see similar gains just going from 1k to 
4k though: even "cp" benefits from using a 4k block size rather than 1k.  I'd 
guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't 
see much difference going to 16k.  In fact, in the MacOS tests the big jump 
seems to be from 1k to 4k with smaller improvements thereafer.

I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is 
usually dire and anything that will cut down on the I/O is going to be a big 
help.  I would expect that the difference would be more dramatic with a FAT32 
file system than it would be with NTFS though.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Reply via email to