Angel,
Have you stepped through this with a debugger?  That might reveal something.
Have you tried doing kill -QUIT <PID> while waiting for those slow calls you 
mention to return?  Perhaps this will show that the slow calls spend their time 
somewhere where the faster calls never go.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Angel Faus <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, April 8, 2008 8:15:11 PM
Subject: FieldCache performance

Hi,

I have a puzzling question about FieldCache; we use FieldCache to
speed up faceted search [just like solr does] and also to avoid
creating Document instances when listing results. It works great, but
for one field where performance is awful: on the order of 15-30
seconds for one index, and can grow to hundreds of seconds when using
MultiReader.

There are two things that surprise me:

  1. -  calling FieldCache.DEFAULT.getInts takes a lot longer on a
certain field ("vid") than on all the others.
 2. - when querying different indexes, the time is not proportional to
the number of documents, or to the number of distinct values of this
field

The code involved is quite simple:

        cache_fuente =
org.apache.lucene.search.FieldCache.DEFAULT.getInts(reader, "fuente");
        cache_vid =
org.apache.lucene.search.FieldCache.DEFAULT.getInts(reader, "vid");

So, for example, when querying our index INDEX-2 (1,634,330 documents)
computing cache_fuente takes 0.10 seconds, computing cache_vid takes
6.60 sec. The difference is a 67x factor.

Both fields are integers, "vid" is larger but not that much [on
average 7.9 digits vs 3.5 digits for "fuente"] and both are stored as
non-tokenized fields [Keywords].

It turns out that INDEX-2 is the best case: when querying INDEX-6
[963,863 documents] the time it takes to compute cache_vid jumps to
16.85 seconds. Other indices take 30,49 seconds, 7,32 seconds, 45.12
seconds, etc.

This variation between indices is not related to the number of
documents nor to the number of distinct values ["vid" is always a
unique field]. I can provide detalied data if it's useful.

And now to the second puzzling point, when querying a MultiReader of
INDEX-2 & INDEX-6 together, the time to compute "cache_vid" jumps to
61.9 seconds. So, it is not additive (remember that separately this
indices take 16.86 + 6.60 = 23.45 seconds)

I've read the source code searching for a clue but with no success. I
would really appreciate if anyone could help me understand what drives
FieldCache performance.

"top" shows that all this time it takes to build cache_vid is spent
using CPU cycles, so I thought it could be parsing-related, but
switching to FieldCache.DEFAULT.getStrings does not help at all.

Finally: we are using lucene-core-2.1.0 for querying, and the search
is served through a servlet in a tomcat process. The index is created
using lucene-1.4.3; i don't know if the difference in versions can
matter.

Thanks in advance for any help,


Angel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to