On a similar topic, has anybody measured query performance as a function of
index size?
Well, I did and the results surprised me. I measured query throughput on 8
indexes that varied in size from 55,000 to 4.4 million documents. When
plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected
to see more of a linear curve with a sharp drop-off at some point.
Interesting

Peter

On 12/5/06, Zhang, Lisheng <[EMAIL PROTECTED]> wrote:

Hi Soeren,

Thanks very much for explanations, yes, there
is no linear relation when searching a keyword
which is only in a few docs.

Best regards, Lisheng

-----Original Message-----
From: Soeren Pekrul [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 05, 2006 10:37 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene search performance: linear?


Hello Lisheng,

a search process has to do usually two thinks. First it has to find the
term in the index. I don't know the implementation of finding a term in
Lucene. I hope that the index is at least a sorted list or a binary
tree, so it can search binary. The time finding a term depends of the
term's number n_t. If it searches binary the complexity is approximately
log(n_t). The search time should be better then linear.

Second it has to collect the documents for a term. This depends of the
documents number n_d for a term. It has to go thru the list of documents
for a term. The time should be proportional to the number of documents
for a term even if it doesn't calculate the similarity. Usually the
number of documents for a single term is less than the total number of
documents in the collection and less than the total number of terms in
the index.

If the number of documents for a single term is less than the total
number of documents the search process for a single term including
process one (finding the term) and process two (collecting the documents
and calculating the score) should be better the linear to the number of
documents.

> I indexed first 220,000, all with a special keyword, I did a simple
> query and only fetched 5 docs, with Hits.length()=220,000.
>
> Then I indexed 440,000 docs, with the same keyword, query it
> again and fetched a few docs, with Hits.length(0=440,000.

In your case the query term is contained in all documents. The number of
documents for a single term is equals the total number of documents in
your collection. The hit collector has to collect all documents. The
collecting process is proportional to the number of documents to
collect. So the search for all documents should be at least linear to
the total number of documents.

Sören

Zhang, Lisheng schrieb:
> Hi,
>
> I indexed first 220,000, all with a special keyword, I did a simple
> query and only fetched 5 docs, with Hits.length()=220,000.
>
> Then I indexed 440,000 docs, with the same keyword, query it
> again and fetched a few docs, with Hits.length(0=440,000.
>
> I found that search time is about linear: 2nd time is about 2 times
> longer than 1st query. I would like to understand:
>
> Does the linear relation come from score calculation, since we
> have to calculate score one by one? Or other reason?
>
> If we have B-tree index I would naively expect a better scalibility?
>
> Thanks very much for your helps,
>
> Lisheng

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to