Re: Lucene optimization with one large index and numerous small indexes.

2004-03-30 Thread Doug Cutting
Esmond Pitt wrote:
Don't want to start a buffer size war, but these have always seemed too
small to me. I'd recommend upping both InputStream and OutputStream buffer
sizes to at least 4k, as this is the cluster size on most disks these days,
and also a common VM page size.
Okay.

Reading and writing in smaller quantities
than these is definitely suboptimal.
This is not obvious to me.  Can you provide Lucene benchmarks which show 
this?  Modern filesystems have extensive caches, perform read-ahead and 
delay writes.  Thus file-based system calls do not have a close 
correspondence to physical operations.

To my thinking, the primary role of file buffering in Lucene is to 
minimize the overhead of the system call itself, not to minimize 
physical i/o operations.  Once the overhead of the system call is made 
insignificant, larger buffers offer little measurable improvement.

Also, we cannot increase the size of these blindly.  Buffers are the 
largest source of per-query memory allocation in Lucene, with one (or 
two for phrases and spans) allocated for every query term.  Folks whose 
applications perform wildcard queries have encountered out-of-memory 
exceptions with the current buffer size.

Possibly one could implement a term wildcard mechanism which does not 
require a buffer per term, or perhaps one could allocate small buffers 
for infrequent terms (the vast majority).  If such changes were made 
then it might be feasable to bump up the buffer size somewhat.  But, 
back to my first point, one must first show that larger buffers offer 
significant performance improvements.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote:
We're using lucene with one large target index which right now is 5G.  
Every night we take sub-indexes which are about 500M and merging them 
into this main index.  This merge (done via 
IndexWriter.addIndexes(Directory[]) is taking way too much time.

Looking at the stats for the box we're essentially blocked on reads.  
The disk is blocked on read IO and CPU is at 5%.  If I'm right I think 
this could be minimized by continually picking the two smaller indexes, 
merging them, then picking the next two smallest, merging them, and then 
keep doing this until we're down to one index.

Does this sound about right?
I don't think this will make things much faster.  You'll do somewhat 
fewer seeks, but you'll have to make log(N) passes over all of the data, 
about three or four in your case.  Merging ten indexes in a single pass 
should be fastest, as all of the data is only processed once, but the 
read-ahead on each file needs to be sufficient so that i/o is not 
dominated by seeks.  Can you use iostat or somesuch to find how many 
seeks/second you're seeing on the device?  Also, what's the average 
transfer rate?  Is it anywhere near the disk's capacity?  Finally, if 
possible, write the merged index to a different drive.  Reading the 
inputs from different drives may help as well.

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you spend 
more time doing transfers with less wasted on seeks.  If that helps, 
then perhaps we ought to make this settable via system property or somesuch.

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% - 15% faster in a few strawman 
benchmarks I ran.   

Note that right now this var is final and not public... so that will 
probably need to change.  Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.  

I'm obviously willing to throw memory at the problem

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote:
One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% - 15% faster in a few strawman 
benchmarks I ran.  
How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?

Note that right now this var is final and not public... so that will 
probably need to change.
Perhaps.  I'm reticent to make it too easy to change this.  People tend 
to randomly tweak every available knob and then report bugs, or, if it 
doesn't crash, start recommending that everyone else tweak the knob as 
they do.  There are lots of tradeoffs with buffer size, cases that folks 
might not think of (like that a wildcard query creates a buffer for 
every term that matches), etc.

Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.
It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a different 
disk then it shouldn't make much difference at all.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?
I'll do some more testing tonight and get back to you

Note that right now this var is final and not public... so that will 
probably need to change.


Perhaps.  I'm reticent to make it too easy to change this.  People 
tend to randomly tweak every available knob and then report bugs, or, 
if it doesn't crash, start recommending that everyone else tweak the 
knob as they do.  There are lots of tradeoffs with buffer size, cases 
that folks might not think of (like that a wildcard query creates a 
buffer for every term that matches), etc.
Or you can do what I do and recompile ;) 

Does it make sense to also increase the OutputStream.BUFFER_SIZE?  
This would seem to make sense since an optimize is a large number of 
reads and writes.


It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a 
different disk then it shouldn't make much difference at all.

Right now we are merging to the same disk...  I'll perform some real 
benchmarks with this var too.  Long term we're going to migrate to using 
to SCSI disks per machine and then doing parallel queries across them 
with optimized indexes.

Also with modern disk controllers and filesystems I'm not sure how much 
difference this should make.  Both Reiser and XFS do a lot of internal 
buffering as does our disk controller.  I guess I'll find out...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Esmond Pitt
Don't want to start a buffer size war, but these have always seemed too
small to me. I'd recommend upping both InputStream and OutputStream buffer
sizes to at least 4k, as this is the cluster size on most disks these days,
and also a common VM page size. Reading and writing in smaller quantities
than these is definitely suboptimal.

Esmond Pitt

- Original Message - 
From: Doug Cutting [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, March 30, 2004 8:16 AM
Subject: Re: Lucene optimization with one large index and numerous small
indexes.


 Kevin A. Burton wrote:
  One way to force larger read-aheads might be to pump up Lucene's input
  buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE
  to 1024*1024 or larger.  You'll want to do this just for the merge
  process and not for searching and indexing.  That should help you
  spend more time doing transfers with less wasted on seeks.  If that
  helps, then perhaps we ought to make this settable via system property
  or somesuch.
 
  Good suggestion... seems about 10% - 15% faster in a few strawman
  benchmarks I ran.

 How long is it taking to merge your 5GB index?  Do you have any stats
 about disk utilization during merge (seeks/second, bytes
 transferred/second)?  Did you try buffer sizes even larger than 1MB?
 Are you writing to a different disk, as suggested?

  Note that right now this var is final and not public... so that will
  probably need to change.

 Perhaps.  I'm reticent to make it too easy to change this.  People tend
 to randomly tweak every available knob and then report bugs, or, if it
 doesn't crash, start recommending that everyone else tweak the knob as
 they do.  There are lots of tradeoffs with buffer size, cases that folks
 might not think of (like that a wildcard query creates a buffer for
 every term that matches), etc.

  Does it make sense to also increase the
  OutputStream.BUFFER_SIZE?  This would seem to make sense since an
  optimize is a large number of reads and writes.

 It might help a little if you're merging to the same disk as you're
 reading from, but probably not a lot.  If you're merging to a different
 disk then it shouldn't make much difference at all.

 Doug

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene optimization with one large index and numerous small indexes.

2004-03-28 Thread Kevin A. Burton
We're using lucene with one large target index which right now is 5G.  
Every night we take sub-indexes which are about 500M and merging them 
into this main index.  This merge (done via 
IndexWriter.addIndexes(Directory[]) is taking way too much time.

Looking at the stats for the box we're essentially blocked on reads.  
The disk is blocked on read IO and CPU is at 5%.  If I'm right I think 
this could be minimized by continually picking the two smaller indexes, 
merging them, then picking the next two smallest, merging them, and then 
keep doing this until we're down to one index.

Does this sound about right?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature