Hi Dennis, I don't follow this group in realtime, so I know this is a late reply, and if you reply to me please CC me directly.
I've had good luck with nutch and using tons of memory, I went well past 3 Gigs. To be fair, I don't know how much was nutch spidering vs. lucene indexing. http://www.enterprisesearchblog.com/2009/01/virtualization-and-search-performance-tests-summary.html The only problem I can think of is that, for using large of memory, all three components to be 64 bit; ;64 bit chip, 64 bit OS and 64 bit JVM I imagine you know this already, just triple checking. I was using a stock Sun JVM, 64 bit jvm for Windows. I suppose that could be a difference, maybe they don't provide a 64 bit jvm for linux? Or maybe you were using somebody else's jvm? Mark PS: Reminder, if you reply to me, pls CC me. -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Tue, Jan 6, 2009 at 5:41 AM, VishalS <vish...@rediff.co.in> wrote: > Hi, > > > > I am experimenting with a system with around 120 million documents. The > index is split into sub-indices of ~10M documents - each such index is > being > searched by a single machine. The results are being aggregated using the > DistributedSearcher client. I am seeing a lot of performance issues with > the > system - most of the times, the response times are >4 seconds, and in some > cases it goes upto a minute. > > > > It would be wonderful to know if there are ways to optimize what I am > doing, or if there is something obvious that I am doing wrong. Here's what > I > have tried so far, and the issues I see: > > > > 1. Each search server is a 64-bit Pentium machine with ~7GB RAM and 4 > CPUs running Linux. However, the searcher is not able to use more than 1 GB > of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a > Lucene > issue. Is there a way we can have the searcher use more RAM to speed things > up? > 2. The total size of the index directory on each machine is ~70-100 > GB. > The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file > is around 3GB. Is this too big? > 3. I have tried analyzing my documents for commonly occurring terms in > various fields, and added these terms to common-terms.utf8. There are ~10K > terms in this file for me now. I am hoping this will help me speed up any > phrase queries I am doing internally (although there is a cost attached in > terms of the number of unique terms in the Lucene index, the total index > size has increased by ~10-15%, which I guess is ok.) > 4. There are around 8 fields that are searched in for each of the > words > in the query. Also, a phrase query containing all the words is fired in > each > of these fields as well. This means that for a 3 word input query, the > number of sub-queries in my Lucene query are 24(3*8) term queries and > 8(1*8) > 3-word phrase queries. Is this too long or too expensive? > 5. I have noticed that the slowest running queries (it takes upto a > minute sometimes) are many times the ones that have one or more common > words. > 6. Each individual searcher has a single Lucene indexlet. Would it be > faster to have more than 1 indexlet on the machine? > 7. I am using a tomcat 6.0 installation out-of-the-box, with some > minor > changes in the number of threads, the java stack size allocation. > > > > If there's anyone else who has had experience working with large indices, I > would love to get in touch and exchange notes. > > > > Regards, > > > > -Vishal. > >