On a similar topic, has anybody measured query performance as a function of index size? Well, I did and the results surprised me. I measured query throughput on 8 indexes that varied in size from 55,000 to 4.4 million documents. When plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected to see more of a linear curve with a sharp drop-off at some point. Interesting
Peter On 12/5/06, Zhang, Lisheng <[EMAIL PROTECTED]> wrote:
Hi Soeren, Thanks very much for explanations, yes, there is no linear relation when searching a keyword which is only in a few docs. Best regards, Lisheng -----Original Message----- From: Soeren Pekrul [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 05, 2006 10:37 AM To: java-user@lucene.apache.org Subject: Re: Lucene search performance: linear? Hello Lisheng, a search process has to do usually two thinks. First it has to find the term in the index. I don't know the implementation of finding a term in Lucene. I hope that the index is at least a sorted list or a binary tree, so it can search binary. The time finding a term depends of the term's number n_t. If it searches binary the complexity is approximately log(n_t). The search time should be better then linear. Second it has to collect the documents for a term. This depends of the documents number n_d for a term. It has to go thru the list of documents for a term. The time should be proportional to the number of documents for a term even if it doesn't calculate the similarity. Usually the number of documents for a single term is less than the total number of documents in the collection and less than the total number of terms in the index. If the number of documents for a single term is less than the total number of documents the search process for a single term including process one (finding the term) and process two (collecting the documents and calculating the score) should be better the linear to the number of documents. > I indexed first 220,000, all with a special keyword, I did a simple > query and only fetched 5 docs, with Hits.length()=220,000. > > Then I indexed 440,000 docs, with the same keyword, query it > again and fetched a few docs, with Hits.length(0=440,000. In your case the query term is contained in all documents. The number of documents for a single term is equals the total number of documents in your collection. The hit collector has to collect all documents. The collecting process is proportional to the number of documents to collect. So the search for all documents should be at least linear to the total number of documents. Sören Zhang, Lisheng schrieb: > Hi, > > I indexed first 220,000, all with a special keyword, I did a simple > query and only fetched 5 docs, with Hits.length()=220,000. > > Then I indexed 440,000 docs, with the same keyword, query it > again and fetched a few docs, with Hits.length(0=440,000. > > I found that search time is about linear: 2nd time is about 2 times > longer than 1st query. I would like to understand: > > Does the linear relation come from score calculation, since we > have to calculate score one by one? Or other reason? > > If we have B-tree index I would naively expect a better scalibility? > > Thanks very much for your helps, > > Lisheng --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]