Re: Search performance for large indexes (>100M docs)

Dennis Kubes Fri, 09 Jan 2009 12:59:42 -0800

Essentially you would create a tempfs (ramdisk) and put the indexes inthe tempfs. Assuming your indexes were in a folder called indexes.dist,you would use something like this:


mount -t tmpfs -o size=7516192768 none /your/indexes
rsync --progress -aptv /your/indexes.dist/* /your/indexes/

You will also want to check the mailing list for indexes in Ram. Ibelieve I posted a much more detailed set of instructions before.


Dennis


ianwong wrote:

Can you tell me how to Keep indexes in RAM in nutch query server if client
side uses DistributedSearcher.

Thanks
Ian



Dennis Kubes-2 wrote:

Take a look on the mailing lists for keeping the indexes in memory.When you get to the sizes you are talking about, the way you getsubsecond response times is by:


1) Keeping the indexes in RAM
2) Agressive caching

Dennis

VishalS wrote:

Hi,

  I am experimenting with a system with around 120 million documents. The
index is split into sub-indices of ~10M documents - each such index is
being
searched by a single machine. The results are being aggregated using the
DistributedSearcher client. I am seeing a lot of performance issues with
the
system - most of the times, the response times are >4 seconds, and in
some
cases it goes upto a minute.

  It would be wonderful to know if there are ways to optimize what I am
doing, or if there is something obvious that I am doing wrong. Here's
what I
have tried so far, and the issues I see:

1.      Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
CPUs running Linux. However, the searcher is not able to use more than 1
GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
Lucene
issue. Is there a way we can have the searcher use more RAM to speed
things
up?
2.      The total size of the index directory on each machine is ~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
file
is around 3GB. Is this too big?
3.      I have tried analyzing my documents for commonly occurring terms in
various fields, and added these terms to common-terms.utf8. There are
~10K
terms in this file for me now. I am hoping this will help me speed up any
phrase queries I am doing internally (although there is a cost attached
in
terms of the number of unique terms in the Lucene index, the total index
size has increased by ~10-15%, which I guess is ok.)
4.      There are around 8 fields that are searched in for each of the words
in the query. Also, a phrase query containing all the words is fired in
each
of these fields as well. This means that for a 3 word input query, the
number of sub-queries in my Lucene query are 24(3*8) term queries and
8(1*8)

3-word phrase queries. Is this too long or too expensive?5. I have noticed that the slowest running queries (it takes upto a

minute sometimes) are many times the ones that have one or more common
words.
6.      Each individual searcher has a single Lucene indexlet. Would it be
faster to have more than 1 indexlet on the machine?
7.      I am using a tomcat 6.0 installation out-of-the-box, with some minor

changes in the number of threads, the java stack size allocation.

If there's anyone else who has had experience working with large indices,
I
would love to get in touch and exchange notes.

Regards,

-Vishal.

Re: Search performance for large indexes (>100M docs)

Reply via email to