Re: Search performance for large indexes (>100M docs)

Marc Boucher Tue, 13 Jan 2009 22:48:29 -0800

Otis,

With respect to using SSD's, not all SSD's are created equally. Ifyou're using the Intel X-25M wouldn't you say this is a goodalternative? You don't give up much with this SSD compared to the rest.


Intel X25-M SSD: Intel Delivers One of the World's Fastest Drives
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=1

Marc

On 13-Jan-09, at 8:00 PM, Otis Gospodnetic wrote:

Vishal,
Re 2. - I don't think it's quite true. RAM is still much fasterthan SSDs.
Also, which version of Lucene are you using? Make sure you're usingthe latest one if you care about performance.
Also, if you have extra RAM, you can make your .tii bigger/denserand speed up searches that way.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: VishalS <vish...@rediff.co.in>
To: VishalS <vish...@rediff.co.in>; nutch-user@lucene.apache.org
Sent: Monday, January 12, 2009 6:49:58 AM
Subject: RE: Search performance for large indexes (>100M docs)

Hi,
Thanks for the responses - I have received replies from Otis,Dennis,
Sean and Jay Pound (sorry if I forgot someone). To summarize what I
understood from these replies:
1. The indices *have* to be in fast storage - it's difficult toget
great performance without this.
2.    It's worth looking into SSDs to store the indices. This would
probably help speed up the search performance, is cheaper comparedto RAM
and gives almost similar performance.
3. Jay mentioned that with Nutch 0.7, the hard drives were abottleneckfor him. He got over the issue by using multiple (~15) small hard-drives ona single machine, and running 10 search servers on it - he was ableto get a
reasonable performance through this architecture on 2 machines.
Currently, I am unable to experiment with SSDs since my searchersare hosted
on EC2.
From my experience so far, I am also leaning towards believing thatthequery plugins would also play a very important role in theperformance as
well (apart from relevance).



I will share my observations as I keep going.



Sean - good luck for the experiments you are conducting - way to go!



Regards,



-vishal.

 _____

From: VishalS [mailto:vish...@rediff.co.in]
Sent: Tuesday, January 06, 2009 7:12 PM
To: 'nutch-user@lucene.apache.org'
Subject: Search performance for large indexes (>100M docs)



Hi,
I am experimenting with a system with around 120 milliondocuments. Theindex is split into sub-indices of ~10M documents - each such indexis beingsearched by a single machine. The results are being aggregatedusing theDistributedSearcher client. I am seeing a lot of performance issueswith thesystem - most of the times, the response times are >4 seconds, andin some
cases it goes upto a minute.
It would be wonderful to know if there are ways to optimize what Iamdoing, or if there is something obvious that I am doing wrong.Here's what I
have tried so far, and the issues I see:
1. Each search server is a 64-bit Pentium machine with ~7GB RAMand 4CPUs running Linux. However, the searcher is not able to use morethan 1 GBof RAM even though I have set -Xmx to ~3.5GB. I am guessing this isa Luceneissue. Is there a way we can have the searcher use more RAM tospeed things
up?
2. The total size of the index directory on each machine is~70-100 GB.The prx file is 50GB, the fnm and frq files are ~27GB each and thefdt file
is around 3GB. Is this too big?
3. I have tried analyzing my documents for commonly occurringterms invarious fields, and added these terms to common-terms.utf8. Thereare ~10Kterms in this file for me now. I am hoping this will help me speedup anyphrase queries I am doing internally (although there is a costattached interms of the number of unique terms in the Lucene index, the totalindex
size has increased by ~10-15%, which I guess is ok.)
4. There are around 8 fields that are searched in for each ofthe wordsin the query. Also, a phrase query containing all the words isfired in eachof these fields as well. This means that for a 3 word input query,thenumber of sub-queries in my Lucene query are 24(3*8) term queriesand 8(1*8)
3-word phrase queries. Is this too long or too expensive?
5. I have noticed that the slowest running queries (it takesupto aminute sometimes) are many times the ones that have one or morecommon
words.
6. Each individual searcher has a single Lucene indexlet. Wouldit be
faster to have more than 1 indexlet on the machine?
7. I am using a tomcat 6.0 installation out-of-the-box, withsome minor
changes in the number of threads, the java stack size allocation.
If there's anyone else who has had experience working with largeindices, I
would love to get in touch and exchange notes.



Regards,



-Vishal.

Re: Search performance for large indexes (>100M docs)

Reply via email to