. . wrote:
Are there any plans to cache search result pages like say gigablast does,this would speed up the engine so much, I feel at mozdex I have to wait someone for the results to come back.
I've never found that caching search results helps much.
First, the data used to resolve and display frequently-queried terms and frequently-returned documents will be cached by the filesystem, so queries using these will not perform disk i/o. This index data is compressed, making the filesystem's cache a very efficient use of RAM.
Second, unlike terms, complete queries do not repeat themselves frequently enough that a large cache of results seems to help overall performance. I don't recall the exact numbers, but, when we computed them at Excite, we found that caching the top hits of thousands of queries would result in a cache hit rate of less than 10%. That is not much reward for the amount of memory this would consume.
If someone has a large query log then they can evaluate this for themselves. How many queries does it take to account for, e.g., 40% of queries overall? One must be careful not to "overfit" here. As a methodology, you might chop your log in two, then take the most frequent queries in the first half, and find out what percentage of the second half of the log they account for.
AltaVista, via Raymie Stata, donated a sampling of their query logs from a few years' back for research purposes to the Internet Archive. For those interested, they're available via FTP:
ftp ftp.archive.org login anonymous cd pub/AVLogs
I don't know anything more about their origin and potential uses than is in the README.
When we first received them, I did run a quick analysis, and it looked like caching could help a lot.
I found that:
Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1% of most-common queries accounted for 2,321,141 requests (32%), and the top 10% of most-common queries accounted for 4,288,312 requests (60%).
Doug rightly points out that automatic disk caching transparently gets you a lot of the benefit, but that doesn't scale up efficiently across multiple query-servers.
In such a situation, each front-end query-server could specialize in caching a deterministric part of the most common results -- using a distributed hashtable (DHT) technique -- and get results from each others' RAM caches via quick net rather than slow disk or insufficiently warm local caches. See memcached as an example implementation:
http://www.danga.com/memcached/
-- This also gives me a few "LazyWeb" ideas I'll throw out to the Nutch community:
People always want to P2P-ify Nutch, typically on the crawling side, but then Doug rightly points out that compared to actually serving searches, crawling isn't that big of deal. If queries repeat often enough to justify caching, the P2P win for Nutch could be a volunteer P2P caching net of results, massively lowering traffic against the central search indices.
Here's several levels of how it might work...
Alpha: Gossipy
Volunteers access a Nutch-powered search engine via a special client -- probably a local HTTP proxy. This proxy grabs, say, the top 100 results for the person's queries and remembers them for a predictable amount of time.
The Nutch server remembers who was recently handed which result-sets, and when a repeat query comes in from a different client, that client is redirected to the last peer to have made, and likely to still be caching, that query.
In this way, the Nutch server doesn't even need to keep the 100 results in memory: just the (hash of) the query and the address of peers likely to have the result.
(The result set could be cryptographically signed to prevent tampering in the field.)
Every request does still hit the Nutch server once, though -- which is nice for gathering stats but still costly, which leads to...
Beta: DHT-rrific
The special volunteer clients form a DHT which lets them know which peers are already most likely to contain the query in progress. They are contacted first, resulting in hits to the central search server only if they don't already have the desired query. The central server never sees queries that are served out of the P2P cache, and doesn't need to remember where certain result-sets are cached -- though it still might serve a role in helping to bootstrap the P2P DHT and manage entries/exits.
However, in this scenario, you're leaking potentially private info about your queries to random other volunteers. Which leads to...
Gamma-1: Crypty
The central server actually encrypts the result sets with a deterministic key based on the query, so even the caches don't prima facia know what they're caching, but the queryers still know deterministically which caches to check and how to decrypt the results.
A patient attacker could over time build a dictionary of common queries, but the ability of such a dictionary to compromise user privacy could be limited by varying hash seeds regularly or other techniques, including...
Gamma-2: Mixy
The same machines doing caching also offer anonymous proxying so the cache-locations can't easily deduce the ultimate identities of the users requesting their result sets.
-- The client-side volunteer caching P2P extension could probably even be a Javascript/XUL one-click-install Mozilla extension...
- Gordon @ IA
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
