[ ...]AltaVista, via Raymie Stata, donated a sampling of their query logs from a few years' back for research purposes to the Internet Archive.
I found that:
Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1% of most-common queries accounted for 2,321,141 requests (32%), and the top 10% of most-common queries accounted for 4,288,312 requests (60%).
That doesn't sound right to me.
Excite published query logs as well, but unfortunately I cannot find the logs now. The ones I published were sampled by the user-id cookie. Jack Xu also published more logs, and I don't know how he sampled.
I found some articles about the Excite logs which state:
"terms that were used only once were 1/2 of unique terms"
http://www.scils.rutgers.edu/~tefko/Courses/530/Lectures-general/Excite%20longitudinal.ppt
If this is true, then it means that much more than half of queries are unique, and there's no way a cache could achieve a hit rate of anywhere near 50%.
"531,416 unique queries, 395,461 repeated queries"
http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html
Here the highest cache hit rate possible would be 42%, and that would only be achieved by caching every query that occured more than once.
Since, on average, queries have more than two terms, a good way to model them might be to look at word-bigram distributions. These have a much longer tail than a zipfian distribution.
Doug
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
