Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Andrzej Bialecki Fri, 09 Dec 2005 01:45:08 -0800

Hi,

I made an experiment with Google, to see if they use a similar approach.

I find the results to be most interesting. I selected a query which isguaranteed to give large result sets, but is more complicated than asingle term query: http com.

The total number of hits (approx) is 2,780,000,000. BTW, I find itcurious that the last 3 or 6 digits always seem to be zeros ... there'ssome clever guesstimation involved here. The fact that Google Suggest isable to return results so quickly would support this suspicion.

When I ran the query for the first time, the response time was 0.29 sec.All subsequent queries retrieving the first 10 results are in the orderof 0.07 sec.

This is for retrieving just the first page (first 10 results).Retrieving results 10-20 also takes 0.08 sec, which suggests that thisresult was cached somewhere. Starting from results 20+ the response timeincreases (linearly?), although it varies wildly between requests,sometimes returning quicker, sometimes taking the max time - whichsuggests that I'm hitting different servers each time. Also, if I wait~30 sec to 1 minute, the response times are back to the values for thefirst-time run.


start   first     repeated response
30      0.14      0.08-0.21
50      0.29      0.11-0.22
100     0.36      0.22-0.45
200     0.73      0.49-0.65
300     0.96      0.64-0.98
500     1.36      1.43-1.87
650     2.24      1.49-1.85

The last range was the maximum in this case - Google wouldn't displayany hit above 652 (which I find curious, too - because the total numberof hits is, well, significantly higher - and Google claims to return upto the first 1000 results).

My impressions from this excercise are perhaps not so surprising: Googleis highly optimized for retrieving the first couple of results, and themore results you want to retrieve the worse the performance. Finally,you won't be able to retrieve any results above a couple hundred, quiteoften less than the claimed 1000 results threshold.

As for the exact techniques of this optimization, we'll never know forsure, but it seems like something similar is going on to what yououtlined in your email. I think it would be great to try it out.


Andrzej


Doug Cutting wrote:

Doug Cutting wrote:
Implementing something like this for Lucene would not be toodifficult. The index would need to be re-sorted by document boost:documents would be re-numbered so that highly-boosted documents hadlow document numbers.
In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);
3. Implement a FilterIndexReader that re-numbers using the sortedarray. So, for example, the document numbers in the TermPositionswill a[old.doc()]. Each term's positions will need to be readentirely into memory and sorted to perform this renumbering.
The IndexOptimizer.java class in the searcher package was an oldattempt to create something like what Suel calls "fancy postings". Itcreates an index with the top 10% scoring postings. Since documentsare not renumbered one can intermix postings from this with the fullindex. So for example, one can first try searching using this indexfor terms that occur more than, e.g., 10k times, and use the fullindex for rarer words. If that does not find 1000 hits then the fullindex must be searched. Such an approach can be combined with using apre-sorted index.
I think the first thing to implement would be to implement somethinglike what Suel calls first-1000. Then we need to evaluate this anddetermine, for query log, how different the results are.
Then a HitCollector can simply stop searching once a given number ofhits are found.
Doug



--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Reply via email to