Re: Lucene performance bottlenecks

Andrzej Bialecki Thu, 08 Dec 2005 09:52:10 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Hmm... Please define what "adequate" means. :-) IMHO, "adequate" iswhen for any query the response time is well below 1 second.Otherwise the service seems sluggish. Response times over 3 secondsare normally not acceptable.
It depends. Clearly an average response time of less than 1 second isbetter than an average response time of 3 seconds. There is noargument. That is a more useful search engine. But a search enginewith a 3-second average response time is still much better than nosearch engine at all. If an institution cannot afford to guarantee 1second average response time, must it not run a search engine? Forlow-traffic, non-commercial search engines, sluggishness is not afatal fault.


[...]

Yes, I fully agree with your arguments here - please accept my apologiesif I came across as whining or complaining about that particularinstallation - quite the contrary, I think it's a unique and useful service.

My point was how to improve Nutch response time for large collectionsand for commercial situations, where the service you offer has to meetdemanding requirements for maximum response time... the response timesfrom this particular installation served just as an example of the issue.

There is a total of 8,435,793 pages in that index. Here's a shortlist of queries, the number of matching pages, and the average time(I made just a couple of tests, no stress-loading ;-) )
* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
These are some of the slowest terms in this index.
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
This is not a very interesting query for this collection...

That's not the point - the point is that this is a valid query thatusers may enter, and the search engine should be prepared to returnresults within certain acceptable time limits.

even more time, so even for such relatively small index (from thePOV of the whole Internet!) the response time may drag into severalseconds (try "com").
How often to you search for "com"?

Ugh... Again, that's beyond the point. It's a valid query, and a simpleone at that, and the response time was awful.

Response times over several seconds would mean that users would saygoodbye and never return... ;-)
So, tell me, where will these users then search for archived webcontent related to hurricane katrina? There is no other option. Ifthis were a competitive commercial offering, then some sluggishnesswould indeed be unacceptable, and ~10M pages in a single index mightbe too many on today's processors. But in a non-profit uniqueoffering, I don't think this is unacceptable. Not optimal, butworkable. Should the archive refuse to make this content searchableuntil they have faster or more machines, or until Nutch is faster? Idon't think so.

I hope I explained that I didn't complain about this particularinstallation. I just used this installation to illustrate the problemthat I see also in other installation, where the demands are much higherand much more difficult to meet.

If 10 mln docs is too much for a single server to meet such aperformance target, then this explodes the total number of serversrequired to handle Internet-wide collections of billions pages...
So, I think it's time to re-think the query structure and scoringmechanisms, in order to simplify the Lucene queries generated byNutch - or to do some other tricks...
I think "other tricks" will be more fruitful. Lucene is pretty welloptimized, and I don't think qualitative improvements can be had bysimplifying the queries without substantially reducing theireffectiveness.
The trick that I think would be most fruitful is something like whatTorsten Suel describes in his paper titled "Optimized Query Executionin Large Search Engines with Global Page Ordering".
http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt
I beleive all of the major search engines implement something likethis, where heuristics are used to avoid searching the completeindex. (We certainly did so at Excite.) The results are no longerguaranteed to always be the absolute highest-scoring, but in mostcases are nearly identical.
Implementing something like this for Lucene would not be toodifficult. The index would need to be re-sorted by document boost:documents would be re-numbered so that highly-boosted documents hadlow document numbers. Then a HitCollector can simply stop searchingonce a given number of hits are found.

Now we are talking .. ;-) This sounds relatively simple and worthtrying. Thanks for the pointers!


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Lucene performance bottlenecks

Reply via email to