Re: [Nutch-dev] cache results (+Nutch-P2P)

Gordon Mohr (Internet Archive) Wed, 26 May 2004 15:12:12 -0700

Doug Cutting wrote:

. . wrote:
Are there any plans to cache search result pages like say gigablast does,this would speed up the engine so much, I feel at mozdex I have to wait someone for the results to come back.
I've never found that caching search results helps much.
First, the data used to resolve and display frequently-queried terms and frequently-returned documents will be cached by the filesystem, so queries using these will not perform disk i/o. This index data is compressed, making the filesystem's cache a very efficient use of RAM.

Second, unlike terms, complete queries do not repeat themselves frequently enough that a large cache of results seems to help overall performance. I don't recall the exact numbers, but, when we computed them at Excite, we found that caching the top hits of thousands of queries would result in a cache hit rate of less than 10%. That is not much reward for the amount of memory this would consume.

If someone has a large query log then they can evaluate this for themselves. How many queries does it take to account for, e.g., 40% of queries overall? One must be careful not to "overfit" here. As a methodology, you might chop your log in two, then take the most frequent queries in the first half, and find out what percentage of the second half of the log they account for.


AltaVista, via Raymie Stata, donated a sampling of their query logs
from a few years' back for research purposes to the Internet Archive.
For those interested, they're available via FTP:

  ftp ftp.archive.org
  login anonymous
  cd pub/AVLogs

I don't know anything more about their origin and potential uses than
is in the README.

When we first received them, I did run a quick analysis, and it looked
like caching could help a lot.

I found that:

Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1%
of most-common queries accounted for 2,321,141 requests (32%), and the
top 10% of most-common queries accounted for 4,288,312 requests (60%).

Doug rightly points out that automatic disk caching transparently gets
you a lot of the benefit, but that doesn't scale up efficiently across
multiple query-servers.

In such a situation, each front-end query-server could specialize in
caching a deterministric part of the most common results -- using a
distributed hashtable (DHT) technique -- and get results from each
others' RAM caches via quick net rather than slow disk or
insufficiently warm local caches. See memcached as an example
implementation:

   http://www.danga.com/memcached/

--
This also gives me a few "LazyWeb" ideas I'll throw out to the
Nutch community:

People always want to P2P-ify Nutch, typically on the crawling
side, but then Doug rightly points out that compared to actually
serving searches, crawling isn't that big of deal. If queries repeat
often enough to justify caching, the P2P win for Nutch could be
a volunteer P2P caching net of results, massively lowering traffic
against the central search indices.

Here's several levels of how it might work...

Alpha: Gossipy

Volunteers access a Nutch-powered search engine via a
special client -- probably a local HTTP proxy. This proxy
grabs, say, the top 100 results for the person's queries
and remembers them for a predictable amount of time.

The Nutch server remembers who was recently handed
which result-sets, and when a repeat query comes in from
a different client, that client is redirected to the last
peer to have made, and likely to still be caching, that
query.

In this way, the Nutch server doesn't even need to keep
the 100 results in memory: just the (hash of) the query
and the address of peers likely to have the result.

(The result set could be cryptographically signed to
prevent tampering in the field.)

Every request does still hit the Nutch server once,
though -- which is nice for gathering stats but still
costly, which leads to...

Beta: DHT-rrific

The special volunteer clients form a DHT which lets
them know which peers are already most likely to
contain the query in progress. They are contacted
first, resulting in hits to the central search server
only if they don't already have the desired query.
The central server never sees queries that are
served out of the P2P cache, and doesn't need to
remember where certain result-sets are cached --
though it still might serve a role in helping to
bootstrap the P2P DHT and manage entries/exits.

However, in this scenario, you're leaking potentially
private info about your queries to random other
volunteers. Which leads to...

Gamma-1: Crypty

The central server actually encrypts the result sets
with a deterministic key based on the query, so even
the caches don't prima facia know what they're caching,
but the queryers still know deterministically which
caches to check and how to decrypt the results.

A patient attacker could over time build a dictionary
of common queries, but the ability of such a dictionary
to compromise user privacy could be limited by
varying hash seeds regularly or other techniques,
including...

Gamma-2: Mixy

The same machines doing caching also offer anonymous
proxying so the cache-locations can't easily deduce the
ultimate identities of the users requesting their
result sets.


--
The client-side volunteer caching P2P extension could
probably even be a Javascript/XUL one-click-install
Mozilla extension...

- Gordon @ IA

------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] cache results (+Nutch-P2P)

Reply via email to