Re: Lucene performance bottlenecks

Doug Cutting Thu, 08 Dec 2005 09:28:27 -0800

Andrzej Bialecki wrote:

Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is whenfor any query the response time is well below 1 second. Otherwise theservice seems sluggish. Response times over 3 seconds are normally notacceptable.

It depends. Clearly an average response time of less than 1 second isbetter than an average response time of 3 seconds. There is noargument. That is a more useful search engine. But a search enginewith a 3-second average response time is still much better than nosearch engine at all. If an institution cannot afford to guarantee 1second average response time, must it not run a search engine? Forlow-traffic, non-commercial search engines, sluggishness is not a fatalfault.

There is a total of 8,435,793 pages in that index. Here's a short listof queries, the number of matching pages, and the average time (I madejust a couple of tests, no stress-loading ;-) )
* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds


These are some of the slowest terms in this index.

* hurricane katrina: 773,001 pages, 3.5 seconds (!)


This is not a very interesting query for this collection...

* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds

What I found out is that "usable" depends a lot on how you test it andwhat is your minimum expectation. There are some high-frequency terms(and by this I mean terms with frequency around 25%) that willconsistently cause a dramatic slowdown. Multi-term queries, because ofthe way Nutch expands them into sloppy phrases, may take even more time,so even for such relatively small index (from the POV of the wholeInternet!) the response time may drag into several seconds (try "com").


How often to you search for "com"?

Response times over several seconds would mean that userswould say goodbye and never return... ;-)

So, tell me, where will these users then search for archived web contentrelated to hurricane katrina? There is no other option. If this were acompetitive commercial offering, then some sluggishness would indeed beunacceptable, and ~10M pages in a single index might be too many ontoday's processors. But in a non-profit unique offering, I don't thinkthis is unacceptable. Not optimal, but workable. Should the archiverefuse to make this content searchable until they have faster or moremachines, or until Nutch is faster? I don't think so.

If 10 mln docs is too much for a single server to meet such aperformance target, then this explodes the total number of serversrequired to handle Internet-wide collections of billions pages...
So, I think it's time to re-think the query structure and scoringmechanisms, in order to simplify the Lucene queries generated by Nutch -or to do some other tricks...

I think "other tricks" will be more fruitful. Lucene is pretty welloptimized, and I don't think qualitative improvements can be had bysimplifying the queries without substantially reducing their effectiveness.

The trick that I think would be most fruitful is something like whatTorsten Suel describes in his paper titled "Optimized Query Execution inLarge Search Engines with Global Page Ordering".


http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt

I beleive all of the major search engines implement something like this,where heuristics are used to avoid searching the complete index. (Wecertainly did so at Excite.) The results are no longer guaranteed toalways be the absolute highest-scoring, but in most cases are nearlyidentical.

Implementing something like this for Lucene would not be too difficult.The index would need to be re-sorted by document boost: documentswould be re-numbered so that highly-boosted documents had low documentnumbers. Then a HitCollector can simply stop searching once a givennumber of hits are found.


Doug

Re: Lucene performance bottlenecks

Reply via email to