Re: Lucene optimization with one large index and numerous small indexes.
Esmond Pitt wrote: Don't want to start a buffer size war, but these have always seemed too small to me. I'd recommend upping both InputStream and OutputStream buffer sizes to at least 4k, as this is the cluster size on most disks these days, and also a common VM page size. Okay. Reading and writing in smaller quantities than these is definitely suboptimal. This is not obvious to me. Can you provide Lucene benchmarks which show this? Modern filesystems have extensive caches, perform read-ahead and delay writes. Thus file-based system calls do not have a close correspondence to physical operations. To my thinking, the primary role of file buffering in Lucene is to minimize the overhead of the system call itself, not to minimize physical i/o operations. Once the overhead of the system call is made insignificant, larger buffers offer little measurable improvement. Also, we cannot increase the size of these blindly. Buffers are the largest source of per-query memory allocation in Lucene, with one (or two for phrases and spans) allocated for every query term. Folks whose applications perform wildcard queries have encountered out-of-memory exceptions with the current buffer size. Possibly one could implement a term wildcard mechanism which does not require a buffer per term, or perhaps one could allocate small buffers for infrequent terms (the vast majority). If such changes were made then it might be feasable to bump up the buffer size somewhat. But, back to my first point, one must first show that larger buffers offer significant performance improvements. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene optimization with one large index and numerous small indexes.
Kevin A. Burton wrote: We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats for the box we're essentially blocked on reads. The disk is blocked on read IO and CPU is at 5%. If I'm right I think this could be minimized by continually picking the two smaller indexes, merging them, then picking the next two smallest, merging them, and then keep doing this until we're down to one index. Does this sound about right? I don't think this will make things much faster. You'll do somewhat fewer seeks, but you'll have to make log(N) passes over all of the data, about three or four in your case. Merging ten indexes in a single pass should be fastest, as all of the data is only processed once, but the read-ahead on each file needs to be sufficient so that i/o is not dominated by seeks. Can you use iostat or somesuch to find how many seeks/second you're seeing on the device? Also, what's the average transfer rate? Is it anywhere near the disk's capacity? Finally, if possible, write the merged index to a different drive. Reading the inputs from different drives may help as well. One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Good suggestion... seems about 10% - 15% faster in a few strawman benchmarks I ran. Note that right now this var is final and not public... so that will probably need to change. Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. I'm obviously willing to throw memory at the problem -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Kevin A. Burton wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Good suggestion... seems about 10% - 15% faster in a few strawman benchmarks I ran. How long is it taking to merge your 5GB index? Do you have any stats about disk utilization during merge (seeks/second, bytes transferred/second)? Did you try buffer sizes even larger than 1MB? Are you writing to a different disk, as suggested? Note that right now this var is final and not public... so that will probably need to change. Perhaps. I'm reticent to make it too easy to change this. People tend to randomly tweak every available knob and then report bugs, or, if it doesn't crash, start recommending that everyone else tweak the knob as they do. There are lots of tradeoffs with buffer size, cases that folks might not think of (like that a wildcard query creates a buffer for every term that matches), etc. Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. It might help a little if you're merging to the same disk as you're reading from, but probably not a lot. If you're merging to a different disk then it shouldn't make much difference at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: How long is it taking to merge your 5GB index? Do you have any stats about disk utilization during merge (seeks/second, bytes transferred/second)? Did you try buffer sizes even larger than 1MB? Are you writing to a different disk, as suggested? I'll do some more testing tonight and get back to you Note that right now this var is final and not public... so that will probably need to change. Perhaps. I'm reticent to make it too easy to change this. People tend to randomly tweak every available knob and then report bugs, or, if it doesn't crash, start recommending that everyone else tweak the knob as they do. There are lots of tradeoffs with buffer size, cases that folks might not think of (like that a wildcard query creates a buffer for every term that matches), etc. Or you can do what I do and recompile ;) Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. It might help a little if you're merging to the same disk as you're reading from, but probably not a lot. If you're merging to a different disk then it shouldn't make much difference at all. Right now we are merging to the same disk... I'll perform some real benchmarks with this var too. Long term we're going to migrate to using to SCSI disks per machine and then doing parallel queries across them with optimized indexes. Also with modern disk controllers and filesystems I'm not sure how much difference this should make. Both Reiser and XFS do a lot of internal buffering as does our disk controller. I guess I'll find out... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Don't want to start a buffer size war, but these have always seemed too small to me. I'd recommend upping both InputStream and OutputStream buffer sizes to at least 4k, as this is the cluster size on most disks these days, and also a common VM page size. Reading and writing in smaller quantities than these is definitely suboptimal. Esmond Pitt - Original Message - From: Doug Cutting [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, March 30, 2004 8:16 AM Subject: Re: Lucene optimization with one large index and numerous small indexes. Kevin A. Burton wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Good suggestion... seems about 10% - 15% faster in a few strawman benchmarks I ran. How long is it taking to merge your 5GB index? Do you have any stats about disk utilization during merge (seeks/second, bytes transferred/second)? Did you try buffer sizes even larger than 1MB? Are you writing to a different disk, as suggested? Note that right now this var is final and not public... so that will probably need to change. Perhaps. I'm reticent to make it too easy to change this. People tend to randomly tweak every available knob and then report bugs, or, if it doesn't crash, start recommending that everyone else tweak the knob as they do. There are lots of tradeoffs with buffer size, cases that folks might not think of (like that a wildcard query creates a buffer for every term that matches), etc. Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. It might help a little if you're merging to the same disk as you're reading from, but probably not a lot. If you're merging to a different disk then it shouldn't make much difference at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene optimization with one large index and numerous small indexes.
We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats for the box we're essentially blocked on reads. The disk is blocked on read IO and CPU is at 5%. If I'm right I think this could be minimized by continually picking the two smaller indexes, merging them, then picking the next two smallest, merging them, and then keep doing this until we're down to one index. Does this sound about right? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature