Andrzej Bialecki wrote:
Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when for any query the response time is well below 1 second. Otherwise the service seems sluggish. Response times over 3 seconds are normally not acceptable.

It depends. Clearly an average response time of less than 1 second is better than an average response time of 3 seconds. There is no argument. That is a more useful search engine. But a search engine with a 3-second average response time is still much better than no search engine at all. If an institution cannot afford to guarantee 1 second average response time, must it not run a search engine? For low-traffic, non-commercial search engines, sluggishness is not a fatal fault.

There is a total of 8,435,793 pages in that index. Here's a short list of queries, the number of matching pages, and the average time (I made just a couple of tests, no stress-loading ;-) )

* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds

These are some of the slowest terms in this index.

* hurricane katrina: 773,001 pages, 3.5 seconds (!)

This is not a very interesting query for this collection...

* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds

What I found out is that "usable" depends a lot on how you test it and what is your minimum expectation. There are some high-frequency terms (and by this I mean terms with frequency around 25%) that will consistently cause a dramatic slowdown. Multi-term queries, because of the way Nutch expands them into sloppy phrases, may take even more time, so even for such relatively small index (from the POV of the whole Internet!) the response time may drag into several seconds (try "com").

How often to you search for "com"?

Response times over several seconds would mean that users would say goodbye and never return... ;-)

So, tell me, where will these users then search for archived web content related to hurricane katrina? There is no other option. If this were a competitive commercial offering, then some sluggishness would indeed be unacceptable, and ~10M pages in a single index might be too many on today's processors. But in a non-profit unique offering, I don't think this is unacceptable. Not optimal, but workable. Should the archive refuse to make this content searchable until they have faster or more machines, or until Nutch is faster? I don't think so.

If 10 mln docs is too much for a single server to meet such a performance target, then this explodes the total number of servers required to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoring mechanisms, in order to simplify the Lucene queries generated by Nutch - or to do some other tricks...

I think "other tricks" will be more fruitful. Lucene is pretty well optimized, and I don't think qualitative improvements can be had by simplifying the queries without substantially reducing their effectiveness.

The trick that I think would be most fruitful is something like what Torsten Suel describes in his paper titled "Optimized Query Execution in Large Search Engines with Global Page Ordering".

http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt

I beleive all of the major search engines implement something like this, where heuristics are used to avoid searching the complete index. (We certainly did so at Excite.) The results are no longer guaranteed to always be the absolute highest-scoring, but in most cases are nearly identical.

Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. Then a HitCollector can simply stop searching once a given number of hits are found.

Doug


Reply via email to