Re: [analog-help] Disregarding page requests from Googlebot and the like

Duke Hillard Thu, 16 Sep 2004 16:34:14 -0700

Many administrators record only IP numbers in logs to save time.
They know that the numbers can be converted to addresses later.
Hopefully, you're using a utility to convert the addresses in batch
and not converting them individually.

By default, some servers don't collect User Agent for the log file.
Your administrator could collect User Agent information (though
he or she might complain about additional disk space consumed).
If you use a log file in which User Agent information wasn't being
collected at the start of the file, but was added part way through,
you might encounter a warning that some lines can't be read.  If
this happens, there is a way to analyze the file.

Analog ordinarily reports all usage information found in your log.
This is helpful to many organizations, but exclusions are possible.

Some specific pages that might be useful:
* http://www.analog.cx/docs/include.html
* http://www.analog.cx/docs/meaning.html
* http://www.analog.cx/helpers/

Hope that helps,

-- Duke


Zimoski, Tom wrote:

I'm responsible for assessing the use of our library website by
pondering the web server logs.  The log I get from our system
administrator doesn't include a translation of the numerical addresses
of what Analog calls Hosts, nor does it have anything about User Agent.
Here are a couple of lines from the log I get:
64.68.82.184 - - [01/Aug/2004:00:02:12 +0800] "GET /ref/govfedjust.html HTTP/1.0" 200 8268
64.68.82.30 - - [01/Aug/2004:00:02:12 +0800] "GET /ref/govfedmil.html
HTTP/1.0" 200 6903
Out of curiosity I have been translating the numerical addresses of the Hosts from which we get a lot of requests, and I notice that a lot of them are from googlebot and the like. I'm thinking this doesn't really count as the sort of "use" of our website I want to keep track of. On the other hand, disregarding requests from these sites takes some effort.

There's some helpful information at http://www.iplists.com/ and also at http://www.searchengineworld.com/spiders/spider_ips.htm but I'm discouraged about how complicated this could become.
So I'm looking for advice about what to do.  I'm thinking about looking
at the hosts from which I get more than x requests in a month, figuring
out which of those are search engines, and throwing them out.  I might
use a percentage of requests rather than a specific number of requests
as a threshhold.
Thanks for your attention.
Tom Zimoski Reference Dept/Fresno County Library

begin:vcard
fn:Duke Hillard
n:Hillard;Duke
org:University of Louisiana at Lafayette;University Computing Support Services
adr:;;P.O. Box 42770;Lafayette;LA;70504-2770;USA
email;internet:[EMAIL PROTECTED]
title:University Webmaster
tel;work:337.482.5763
url:http://www.louisiana.edu/
version:2.1
end:vcard

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
+------------------------------------------------------------------------

Re: [analog-help] Disregarding page requests from Googlebot and the like

Reply via email to