Hi,
I am experimenting with a system with around 120 million documents. The
index is split into sub-indices of ~10M documents - each such index is
being
searched by a single machine. The results are being aggregated using the
DistributedSearcher client. I am seeing a lot of performance issues with
the
system - most of the times, the response times are >4 seconds, and in
some
cases it goes upto a minute.
It would be wonderful to know if there are ways to optimize what I am
doing, or if there is something obvious that I am doing wrong. Here's
what I
have tried so far, and the issues I see:
1. Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
CPUs running Linux. However, the searcher is not able to use more than 1
GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
Lucene
issue. Is there a way we can have the searcher use more RAM to speed
things
up?
2. The total size of the index directory on each machine is ~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
file
is around 3GB. Is this too big?
3. I have tried analyzing my documents for commonly occurring terms in
various fields, and added these terms to common-terms.utf8. There are
~10K
terms in this file for me now. I am hoping this will help me speed up any
phrase queries I am doing internally (although there is a cost attached
in
terms of the number of unique terms in the Lucene index, the total index
size has increased by ~10-15%, which I guess is ok.)
4. There are around 8 fields that are searched in for each of the words
in the query. Also, a phrase query containing all the words is fired in
each
of these fields as well. This means that for a 3 word input query, the
number of sub-queries in my Lucene query are 24(3*8) term queries and
8(1*8)
3-word phrase queries. Is this too long or too expensive?
5. I have noticed that the slowest running queries (it takes upto a
minute sometimes) are many times the ones that have one or more common
words.
6. Each individual searcher has a single Lucene indexlet. Would it be
faster to have more than 1 indexlet on the machine?
7. I am using a tomcat 6.0 installation out-of-the-box, with some minor
changes in the number of threads, the java stack size allocation.
If there's anyone else who has had experience working with large indices,
I
would love to get in touch and exchange notes.
Regards,
-Vishal.