Hi Dennis,

I don't follow this group in realtime, so I know this is a late reply, and
if you reply to me please CC me directly.

I've had good luck with nutch and using tons of memory, I went well past 3
Gigs.  To be fair, I don't know how much was nutch spidering vs. lucene
indexing.
http://www.enterprisesearchblog.com/2009/01/virtualization-and-search-performance-tests-summary.html

The only problem I can think of is that, for using large of memory, all
three components to be 64 bit; ;64 bit chip, 64 bit OS and 64 bit JVM

I imagine you know this already, just triple checking.  I was using a stock
Sun JVM, 64 bit jvm for Windows.  I suppose that could be a difference,
maybe they don't provide a 64 bit jvm for linux?  Or maybe you were using
somebody else's jvm?

Mark

PS: Reminder, if you reply to me, pls CC me.

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Tue, Jan 6, 2009 at 5:41 AM, VishalS <vish...@rediff.co.in> wrote:

> Hi,
>
>
>
>  I am experimenting with a system with around 120 million documents. The
> index is split into sub-indices of ~10M documents - each such index is
> being
> searched by a single machine. The results are being aggregated using the
> DistributedSearcher client. I am seeing a lot of performance issues with
> the
> system - most of the times, the response times are >4 seconds, and in some
> cases it goes upto a minute.
>
>
>
>  It would be wonderful to know if there are ways to optimize what I am
> doing, or if there is something obvious that I am doing wrong. Here's what
> I
> have tried so far, and the issues I see:
>
>
>
> 1.      Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
> CPUs running Linux. However, the searcher is not able to use more than 1 GB
> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
> Lucene
> issue. Is there a way we can have the searcher use more RAM to speed things
> up?
> 2.      The total size of the index directory on each machine is ~70-100
> GB.
> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file
> is around 3GB. Is this too big?
> 3.      I have tried analyzing my documents for commonly occurring terms in
> various fields, and added these terms to common-terms.utf8. There are ~10K
> terms in this file for me now. I am hoping this will help me speed up any
> phrase queries I am doing internally (although there is a cost attached in
> terms of the number of unique terms in the Lucene index, the total index
> size has increased by ~10-15%, which I guess is ok.)
> 4.      There are around 8 fields that are searched in for each of the
> words
> in the query. Also, a phrase query containing all the words is fired in
> each
> of these fields as well. This means that for a 3 word input query, the
> number of sub-queries in my Lucene query are 24(3*8) term queries and
> 8(1*8)
> 3-word phrase queries. Is this too long or too expensive?
> 5.      I have noticed that the slowest running queries (it takes upto a
> minute sometimes) are many times the ones that have one or more common
> words.
> 6.      Each individual searcher has a single Lucene indexlet. Would it be
> faster to have more than 1 indexlet on the machine?
> 7.      I am using a tomcat 6.0 installation out-of-the-box, with some
> minor
> changes in the number of threads, the java stack size allocation.
>
>
>
> If there's anyone else who has had experience working with large indices, I
> would love to get in touch and exchange notes.
>
>
>
> Regards,
>
>
>
> -Vishal.
>
>

Reply via email to