Doug Cutting wrote:
Gordon Mohr (Internet Archive) wrote:

AltaVista, via Raymie Stata, donated a sampling of their query logs
from a few years' back for research purposes to the Internet Archive.

[ ...]

I found that:

Of 7,175,648 requests, there were 2,143,776 unique queries. The top 1%
of most-common queries accounted for 2,321,141 requests (32%), and the
top 10% of most-common queries accounted for 4,288,312 requests (60%).


That doesn't sound right to me.

Excite published query logs as well, but unfortunately I cannot find the logs now. The ones I published were sampled by the user-id cookie. Jack Xu also published more logs, and I don't know how he sampled.

I found some articles about the Excite logs which state:

"terms that were used only once were 1/2 of  unique terms"

http://www.scils.rutgers.edu/~tefko/Courses/530/Lectures-general/Excite%20longitudinal.ppt

If this is true, then it means that much more than half of queries are unique, and there's no way a cache could achieve a hit rate of anywhere near 50%.

Er, I interpret that *exactly* the opposite way. Consider a contrived sample of 1-term queries:

sex food sex car sex food car sex sex rumplestiltskin
food food sex XJ704 car sex food sex ebenezer sex

unique term     occurrences
sex             9
food            5
car             3
rumplestiltskin 1
XJ704           1
ebenezer        1
                ===
                20

You could say of this distribution "terms that were used only once were
1/2 of unique terms". But caching just one term -- sex -- would result
in a 45% hit rate, and caching just three terms would get an 85% hit
rate.

"531,416 unique queries, 395,461 repeated queries"

http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html

Here the highest cache hit rate possible would be 42%, and that would only be achieved by caching every query that occured more than once.

I don't see that either.

First, what they mean by "repeated queries" is *not* the same as what is
cacheable across different users: it's when the exact same user repeats
the same query. From the referenced paper:

# _Repeat queries_ are all multiple-occurrences of the same query that
# represent request for multi-page viewing (when a user request to view
# a subsequent page Excite generates the same query).

Similarly, their definition of "unique" might consider two queries
with the same terms, but issued by different users, "unique". It's
not clear.

Their 1M query set is also fairly small, fairly old (from 1997!), and
fairly vaguely described. For example, while they say theier sample
was of queries "submitted during a portion of a single day", it's not
clear that they got *all* queries during that time range. Much like the
AV dataset at IA was only a subset of queries issued during the given
time ranges, this Excite data, given the researchers' focus on followup
query behavior, might have only been "all queries by a subset of users"
during the given period.

Fresh research from a real high-traffic engine is definitely needed!

- Gordon @ IA





Since, on average, queries have more than two terms, a good way to model them might be to look at word-bigram distributions. These have a much longer tail than a zipfian distribution.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers



-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to