> AltaVista, via Raymie Stata, donated a sampling of their query logs
> from a few years' back for research purposes to the Internet Archive.
> For those interested, they're available via FTP:
> 
>    ftp ftp.archive.org
>    login anonymous
>    cd pub/AVLogs
> 
> I don't know anything more about their origin and potential uses than
> is in the README.
> 
> When we first received them, I did run a quick analysis, and it looked
> like caching could help a lot.
> 
> I found that:
> 
> Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1%
> of most-common queries accounted for 2,321,141 requests (32%), and the
> top 10% of most-common queries accounted for 4,288,312 requests (60%).
> 
> Doug rightly points out that automatic disk caching transparently gets
> you a lot of the benefit, but that doesn't scale up efficiently across
> multiple query-servers.
> 
> In such a situation, each front-end query-server could specialize in
> caching a deterministric part of the most common results -- using a
> distributed hashtable (DHT) technique -- and get results from each
> others' RAM caches via quick net rather than slow disk or
> insufficiently warm local caches. See memcached as an example
> implementation:
> 
>     http://www.danga.com/memcached/
> 

That's pretty interesting. I wonder why the results are quite different to
what Doug referred to at excite? (I don't have ftp access right now, but
could someone tell me the date those altavista results are from?)

Since we'd only need to cache the first page of results (I assume?) I have
to wonder if distributing the cache would be required.

How much memory would 20000 sets of 10 result items (for example) take?
Could we just run each query through a Bloom filter, and only lookup the
cache if the filter returns a hit? 

(Although I have to admit it would probably be more fun doing the DHT
implementation...)

Nick


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to