Vishal, Re 2. - I don't think it's quite true. RAM is still much faster than SSDs.
Also, which version of Lucene are you using? Make sure you're using the latest one if you care about performance. Also, if you have extra RAM, you can make your .tii bigger/denser and speed up searches that way. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: VishalS <vish...@rediff.co.in> > To: VishalS <vish...@rediff.co.in>; nutch-user@lucene.apache.org > Sent: Monday, January 12, 2009 6:49:58 AM > Subject: RE: Search performance for large indexes (>100M docs) > > Hi, > > > > Thanks for the responses - I have received replies from Otis, Dennis, > Sean and Jay Pound (sorry if I forgot someone). To summarize what I > understood from these replies: > > > > 1. The indices *have* to be in fast storage - it's difficult to get > great performance without this. > 2. It's worth looking into SSDs to store the indices. This would > probably help speed up the search performance, is cheaper compared to RAM > and gives almost similar performance. > 3. Jay mentioned that with Nutch 0.7, the hard drives were a bottleneck > for him. He got over the issue by using multiple (~15) small hard-drives on > a single machine, and running 10 search servers on it - he was able to get a > reasonable performance through this architecture on 2 machines. > > > > Currently, I am unable to experiment with SSDs since my searchers are hosted > on EC2. > > > > From my experience so far, I am also leaning towards believing that the > query plugins would also play a very important role in the performance as > well (apart from relevance). > > > > I will share my observations as I keep going. > > > > Sean - good luck for the experiments you are conducting - way to go! > > > > Regards, > > > > -vishal. > > _____ > > From: VishalS [mailto:vish...@rediff.co.in] > Sent: Tuesday, January 06, 2009 7:12 PM > To: 'nutch-user@lucene.apache.org' > Subject: Search performance for large indexes (>100M docs) > > > > Hi, > > > > I am experimenting with a system with around 120 million documents. The > index is split into sub-indices of ~10M documents - each such index is being > searched by a single machine. The results are being aggregated using the > DistributedSearcher client. I am seeing a lot of performance issues with the > system - most of the times, the response times are >4 seconds, and in some > cases it goes upto a minute. > > > > It would be wonderful to know if there are ways to optimize what I am > doing, or if there is something obvious that I am doing wrong. Here's what I > have tried so far, and the issues I see: > > > > 1. Each search server is a 64-bit Pentium machine with ~7GB RAM and 4 > CPUs running Linux. However, the searcher is not able to use more than 1 GB > of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a Lucene > issue. Is there a way we can have the searcher use more RAM to speed things > up? > 2. The total size of the index directory on each machine is ~70-100 GB. > The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file > is around 3GB. Is this too big? > 3. I have tried analyzing my documents for commonly occurring terms in > various fields, and added these terms to common-terms.utf8. There are ~10K > terms in this file for me now. I am hoping this will help me speed up any > phrase queries I am doing internally (although there is a cost attached in > terms of the number of unique terms in the Lucene index, the total index > size has increased by ~10-15%, which I guess is ok.) > 4. There are around 8 fields that are searched in for each of the words > in the query. Also, a phrase query containing all the words is fired in each > of these fields as well. This means that for a 3 word input query, the > number of sub-queries in my Lucene query are 24(3*8) term queries and 8(1*8) > 3-word phrase queries. Is this too long or too expensive? > 5. I have noticed that the slowest running queries (it takes upto a > minute sometimes) are many times the ones that have one or more common > words. > 6. Each individual searcher has a single Lucene indexlet. Would it be > faster to have more than 1 indexlet on the machine? > 7. I am using a tomcat 6.0 installation out-of-the-box, with some minor > changes in the number of threads, the java stack size allocation. > > > > If there's anyone else who has had experience working with large indices, I > would love to get in touch and exchange notes. > > > > Regards, > > > > -Vishal.