Matt, One important bit that you didn't mention is what your maxBufferedSize setting is. If it's too low you will see lots of IO. Increasing it means less IO, but more JVM heap need. Is your disk IO caused by searches or indexing only?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: mattspitz <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Saturday, August 16, 2008 4:07:52 AM > Subject: Appropriate disk optimization for large index? > > > Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML > documents. I'm storing ~150 million documents, taking up 150 GB of space. > > I index the HTML text, but I only store primary key information that allows > me to retrieve it later. Thus, my document size is small, but obviously, I > need to store the index as well, and I imagine that's what takes up almost > all of the space. > > Since I allow users to search this HTML specific to their user index, I > create multiple indexes (~2000) such that a given user only has to search > one of the 2000 indexes to get to their specific document. I also have > queries that span all 2000 indexes. > > So, I have 2000 indexes full of small documents but relatively large > indexing space. > > My question is what sort of disk to buy. Using "dstat", I've determined > that the disk is clearly the bottleneck. Nearly all the time I spend > indexing "chunks" of documents and committing them to disk is spent waiting > on I/O operations. I spawn multiple threads to access the various index > writers so as to minimize I/O wait time, but disk always ends up being the > problem. > > Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k SAS > drives (RAID 0 as well) on hand. > > My question is, what's the access pattern of Lucene when it comes to > indexing documents, merging segments, and eventually optimizing them (given > what I've mentioned about document count and document size)? > > Am I better off with a drive that has a faster seek time, or do I need to > optimize for sustained throughput? How does the way in which Lucene lays > indexes on disk affect this? > > If it helps, my merge factor is 50, and given that I run out of file > descriptors otherwise, I use the compound file format. > > Thanks for your help, > Matt > -- > View this message in context: > http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]