Re: Search performance for large indexes (>100M docs)

Dennis Kubes Thu, 08 Jan 2009 19:22:48 -0800


buddha1021 wrote:

hi dennis:
 in your opinion,which is the most important reason for the fast search
speed of google :
1 google's programme(Code) is very excellence.     or

Yes. They are performance fanatics (literally). But there is only somuch you are going to be able to optimize code, even if it is written inassembly.

2 google put all the indexes into the RAM.

Yes. They would have too. I don't see another way. Not saying thereisn't one, but I haven't found it yet.

which is the most important reason?

I think having the indexes in RAM is the most important factor. Buthaving a large caching layer in front of the index is also important.Also having a supplemental index is another key factor IMO.


and,if nutch put all the indexes into RAM,can nutch's search speed be as
fast as google?

Yes I definitely think it is possible to get Google speed with Nutch.Check out www.visvo.com or search.wikia.com. Both use in memoryindexes. It is not something you would just deploy and have out of thebox. Google is as fast as they are because they have built out everystep efficiently from the code to the operations to the bandwidth and DNS.


Dennis



Dennis Kubes-2 wrote:

Take a look on the mailing lists for keeping the indexes in memory.When you get to the sizes you are talking about, the way you getsubsecond response times is by:


1) Keeping the indexes in RAM
2) Agressive caching

Dennis

VishalS wrote:

Hi,

  I am experimenting with a system with around 120 million documents. The
index is split into sub-indices of ~10M documents - each such index is
being
searched by a single machine. The results are being aggregated using the
DistributedSearcher client. I am seeing a lot of performance issues with
the
system - most of the times, the response times are >4 seconds, and in
some
cases it goes upto a minute.

  It would be wonderful to know if there are ways to optimize what I am
doing, or if there is something obvious that I am doing wrong. Here's
what I
have tried so far, and the issues I see:

1.      Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
CPUs running Linux. However, the searcher is not able to use more than 1
GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
Lucene
issue. Is there a way we can have the searcher use more RAM to speed
things
up?
2.      The total size of the index directory on each machine is ~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
file
is around 3GB. Is this too big?
3.      I have tried analyzing my documents for commonly occurring terms in
various fields, and added these terms to common-terms.utf8. There are
~10K
terms in this file for me now. I am hoping this will help me speed up any
phrase queries I am doing internally (although there is a cost attached
in
terms of the number of unique terms in the Lucene index, the total index
size has increased by ~10-15%, which I guess is ok.)
4.      There are around 8 fields that are searched in for each of the words
in the query. Also, a phrase query containing all the words is fired in
each
of these fields as well. This means that for a 3 word input query, the
number of sub-queries in my Lucene query are 24(3*8) term queries and
8(1*8)

3-word phrase queries. Is this too long or too expensive?5. I have noticed that the slowest running queries (it takes upto a

minute sometimes) are many times the ones that have one or more common
words.
6.      Each individual searcher has a single Lucene indexlet. Would it be
faster to have more than 1 indexlet on the machine?
7.      I am using a tomcat 6.0 installation out-of-the-box, with some minor

changes in the number of threads, the java stack size allocation.

If there's anyone else who has had experience working with large indices,
I
would love to get in touch and exchange notes.

Regards,

-Vishal.

Re: Search performance for large indexes (>100M docs)

Reply via email to