That's pretty interesting. I wonder why the results are quite different to what Doug referred to at excite? (I don't have ftp access right now, but could someone tell me the date those altavista results are from?)
2001-09-28 to 2001-10-03. It's a subset of all their queries for the times indicated, but (according to the README) constructed such that if a query appears at all, all occurrences of that query, including followup requests for deeper ranges, appear. Unless my intuition is way off, I don't think that would skew the proportional results any.
Doug spoke of "thousands" of cached queries resulting in a less than 10% hit rate. That's not necessarily in conflict with the AV sample: there, the top 1% of queries -- over *21K* -- seemed to get a 32% hit rate. And remember that's just a subsample of all AV queries in the period... while you might be able to generalize that 1% unique queries ~~ 30% query traffic, that 1% of unique queries might in fact be 500K, 2M, or more.
Since we'd only need to cache the first page of results (I assume?) I have to wonder if distributing the cache would be required.
How much memory would 20000 sets of 10 result items (for example) take?
Running a quick test on a Mozdex HTML page of 10 results, shorn of its header/footer/style/js content, gives about 1.9K, gzipped. So a 3GB RAM cache could store about 1.5 million 10-item results.
Could we just run each query through a Bloom filter, and only lookup the
cache if the filter returns a hit?
To check the RAM cache by hash-lookup is already essentially instantaneous, so I wouldn't see a use for a bloom-filter approximation unless you wanted to have a secondary disk-based cache which had expensive lookups, as well.
(Although I have to admit it would probably be more fun doing the DHT implementation...)
I've only given it a glance-over, but I think memcached does most of what's needed for a shared distributed cache out-of-the-box. Reuse is fun too!
- Gordon @ IA
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
