From: VishalS <vish...@rediff.co.in>
To: VishalS <vish...@rediff.co.in>; nutch-user@lucene.apache.org
Sent: Monday, January 12, 2009 6:49:58 AM
Subject: RE: Search performance for large indexes (>100M docs)
Hi,
Thanks for the responses - I have received replies from Otis,
Dennis,
Sean and Jay Pound (sorry if I forgot someone). To summarize what I
understood from these replies:
1. The indices *have* to be in fast storage - it's difficult to
get
great performance without this.
2. It's worth looking into SSDs to store the indices. This would
probably help speed up the search performance, is cheaper compared
to RAM
and gives almost similar performance.
3. Jay mentioned that with Nutch 0.7, the hard drives were a
bottleneck
for him. He got over the issue by using multiple (~15) small hard-
drives on
a single machine, and running 10 search servers on it - he was able
to get a
reasonable performance through this architecture on 2 machines.
Currently, I am unable to experiment with SSDs since my searchers
are hosted
on EC2.
From my experience so far, I am also leaning towards believing that
the
query plugins would also play a very important role in the
performance as
well (apart from relevance).
I will share my observations as I keep going.
Sean - good luck for the experiments you are conducting - way to go!
Regards,
-vishal.
_____
From: VishalS [mailto:vish...@rediff.co.in]
Sent: Tuesday, January 06, 2009 7:12 PM
To: 'nutch-user@lucene.apache.org'
Subject: Search performance for large indexes (>100M docs)
Hi,
I am experimenting with a system with around 120 million
documents. The
index is split into sub-indices of ~10M documents - each such index
is being
searched by a single machine. The results are being aggregated
using the
DistributedSearcher client. I am seeing a lot of performance issues
with the
system - most of the times, the response times are >4 seconds, and
in some
cases it goes upto a minute.
It would be wonderful to know if there are ways to optimize what I
am
doing, or if there is something obvious that I am doing wrong.
Here's what I
have tried so far, and the issues I see:
1. Each search server is a 64-bit Pentium machine with ~7GB RAM
and 4
CPUs running Linux. However, the searcher is not able to use more
than 1 GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is
a Lucene
issue. Is there a way we can have the searcher use more RAM to
speed things
up?
2. The total size of the index directory on each machine is
~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the
fdt file
is around 3GB. Is this too big?
3. I have tried analyzing my documents for commonly occurring
terms in
various fields, and added these terms to common-terms.utf8. There
are ~10K
terms in this file for me now. I am hoping this will help me speed
up any
phrase queries I am doing internally (although there is a cost
attached in
terms of the number of unique terms in the Lucene index, the total
index
size has increased by ~10-15%, which I guess is ok.)
4. There are around 8 fields that are searched in for each of
the words
in the query. Also, a phrase query containing all the words is
fired in each
of these fields as well. This means that for a 3 word input query,
the
number of sub-queries in my Lucene query are 24(3*8) term queries
and 8(1*8)
3-word phrase queries. Is this too long or too expensive?
5. I have noticed that the slowest running queries (it takes
upto a
minute sometimes) are many times the ones that have one or more
common
words.
6. Each individual searcher has a single Lucene indexlet. Would
it be
faster to have more than 1 indexlet on the machine?
7. I am using a tomcat 6.0 installation out-of-the-box, with
some minor
changes in the number of threads, the java stack size allocation.
If there's anyone else who has had experience working with large
indices, I
would love to get in touch and exchange notes.
Regards,
-Vishal.