Re: [Nutch-dev] cache results (+Nutch-P2P)

Doug Cutting Thu, 27 May 2004 09:50:50 -0700

Gordon Mohr (Internet Archive) wrote:

AltaVista, via Raymie Stata, donated a sampling of their query logs
from a few years' back for research purposes to the Internet Archive.

[ ...]

I found that:

Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1%
of most-common queries accounted for 2,321,141 requests (32%), and the
top 10% of most-common queries accounted for 4,288,312 requests (60%).


That doesn't sound right to me.

Excite published query logs as well, but unfortunately I cannot find the logs now. The ones I published were sampled by the user-id cookie. Jack Xu also published more logs, and I don't know how he sampled.

I found some articles about the Excite logs which state:

"terms that were used only once were 1/2 of  unique terms"

http://www.scils.rutgers.edu/~tefko/Courses/530/Lectures-general/Excite%20longitudinal.ppt

If this is true, then it means that much more than half of queries are unique, and there's no way a cache could achieve a hit rate of anywhere near 50%.

"531,416 unique queries, 395,461 repeated queries"

http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html

Here the highest cache hit rate possible would be 42%, and that would only be achieved by caching every query that occured more than once.

Since, on average, queries have more than two terms, a good way to model them might be to look at word-bigram distributions. These have a much longer tail than a zipfian distribution.

Doug

------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] cache results (+Nutch-P2P)

Reply via email to