Doug Cutting wrote:

Andrzej Bialecki wrote:

Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when for any query the response time is well below 1 second. Otherwise the service seems sluggish. Response times over 3 seconds are normally not acceptable.


It depends. Clearly an average response time of less than 1 second is better than an average response time of 3 seconds. There is no argument. That is a more useful search engine. But a search engine with a 3-second average response time is still much better than no search engine at all. If an institution cannot afford to guarantee 1 second average response time, must it not run a search engine? For low-traffic, non-commercial search engines, sluggishness is not a fatal fault.

[...]

Yes, I fully agree with your arguments here - please accept my apologies if I came across as whining or complaining about that particular installation - quite the contrary, I think it's a unique and useful service.

My point was how to improve Nutch response time for large collections and for commercial situations, where the service you offer has to meet demanding requirements for maximum response time... the response times from this particular installation served just as an example of the issue.

There is a total of 8,435,793 pages in that index. Here's a short list of queries, the number of matching pages, and the average time (I made just a couple of tests, no stress-loading ;-) )

* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds


These are some of the slowest terms in this index.

* hurricane katrina: 773,001 pages, 3.5 seconds (!)


This is not a very interesting query for this collection...


That's not the point - the point is that this is a valid query that users may enter, and the search engine should be prepared to return results within certain acceptable time limits.


even more time, so even for such relatively small index (from the POV of the whole Internet!) the response time may drag into several seconds (try "com").


How often to you search for "com"?


Ugh... Again, that's beyond the point. It's a valid query, and a simple one at that, and the response time was awful.

Response times over several seconds would mean that users would say goodbye and never return... ;-)


So, tell me, where will these users then search for archived web content related to hurricane katrina? There is no other option. If this were a competitive commercial offering, then some sluggishness would indeed be unacceptable, and ~10M pages in a single index might be too many on today's processors. But in a non-profit unique offering, I don't think this is unacceptable. Not optimal, but workable. Should the archive refuse to make this content searchable until they have faster or more machines, or until Nutch is faster? I don't think so.


I hope I explained that I didn't complain about this particular installation. I just used this installation to illustrate the problem that I see also in other installation, where the demands are much higher and much more difficult to meet.


If 10 mln docs is too much for a single server to meet such a performance target, then this explodes the total number of servers required to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoring mechanisms, in order to simplify the Lucene queries generated by Nutch - or to do some other tricks...


I think "other tricks" will be more fruitful. Lucene is pretty well optimized, and I don't think qualitative improvements can be had by simplifying the queries without substantially reducing their effectiveness.

The trick that I think would be most fruitful is something like what Torsten Suel describes in his paper titled "Optimized Query Execution in Large Search Engines with Global Page Ordering".

http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt

I beleive all of the major search engines implement something like this, where heuristics are used to avoid searching the complete index. (We certainly did so at Excite.) The results are no longer guaranteed to always be the absolute highest-scoring, but in most cases are nearly identical.

Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. Then a HitCollector can simply stop searching once a given number of hits are found.


Now we are talking .. ;-) This sounds relatively simple and worth trying. Thanks for the pointers!

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to