Andrzej Bialecki wrote:
Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when
for any query the response time is well below 1 second. Otherwise the
service seems sluggish. Response times over 3 seconds are normally not
acceptable.
It depends. Clearly an average response time of less than 1 second is
better than an average response time of 3 seconds. There is no
argument. That is a more useful search engine. But a search engine
with a 3-second average response time is still much better than no
search engine at all. If an institution cannot afford to guarantee 1
second average response time, must it not run a search engine? For
low-traffic, non-commercial search engines, sluggishness is not a fatal
fault.
There is a total of 8,435,793 pages in that index. Here's a short list
of queries, the number of matching pages, and the average time (I made
just a couple of tests, no stress-loading ;-) )
* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
These are some of the slowest terms in this index.
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
This is not a very interesting query for this collection...
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages, 0.005 seconds
What I found out is that "usable" depends a lot on how you test it and
what is your minimum expectation. There are some high-frequency terms
(and by this I mean terms with frequency around 25%) that will
consistently cause a dramatic slowdown. Multi-term queries, because of
the way Nutch expands them into sloppy phrases, may take even more time,
so even for such relatively small index (from the POV of the whole
Internet!) the response time may drag into several seconds (try "com").
How often to you search for "com"?
Response times over several seconds would mean that users
would say goodbye and never return... ;-)
So, tell me, where will these users then search for archived web content
related to hurricane katrina? There is no other option. If this were a
competitive commercial offering, then some sluggishness would indeed be
unacceptable, and ~10M pages in a single index might be too many on
today's processors. But in a non-profit unique offering, I don't think
this is unacceptable. Not optimal, but workable. Should the archive
refuse to make this content searchable until they have faster or more
machines, or until Nutch is faster? I don't think so.
If 10 mln docs is too much for a single server to meet such a
performance target, then this explodes the total number of servers
required to handle Internet-wide collections of billions pages...
So, I think it's time to re-think the query structure and scoring
mechanisms, in order to simplify the Lucene queries generated by Nutch -
or to do some other tricks...
I think "other tricks" will be more fruitful. Lucene is pretty well
optimized, and I don't think qualitative improvements can be had by
simplifying the queries without substantially reducing their effectiveness.
The trick that I think would be most fruitful is something like what
Torsten Suel describes in his paper titled "Optimized Query Execution in
Large Search Engines with Global Page Ordering".
http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt
I beleive all of the major search engines implement something like this,
where heuristics are used to avoid searching the complete index. (We
certainly did so at Excite.) The results are no longer guaranteed to
always be the absolute highest-scoring, but in most cases are nearly
identical.
Implementing something like this for Lucene would not be too difficult.
The index would need to be re-sorted by document boost: documents
would be re-numbered so that highly-boosted documents had low document
numbers. Then a HitCollector can simply stop searching once a given
number of hits are found.
Doug