Not to stray OT, but that seems like poor logic for counting spiders.  
The IP address I'm sending this mail from now is not reverse-resolvable, 
but I'm certainly not a bot.  If you're relying on people's honesty in 
naming choices (e.g. the /(spider|crawl|search|bot)/ domain matching), 
wouldn't you be better off looking at the USER_AGENT value?

George

On Thursday, July 11, 2002, at 05:12 PM, John W. Krahn wrote:

> Kevin Zembower wrote:
>>
>> I'm trying to do a quick-n-dirty (well, I've been at work on it three
>> hours now) analysis of Apache web logs. I'm trying to count the number
>> of records from robots or spiders. For my purposes, a robot or spider 
>> is
>> a request from either an unresolved IP address, or one that has "bot",
>> "spider", "crawl" or "search" in it's resolved domain name. I don't
>> count at all requests that come from my LAN (172.16.0.0/16) or domain
>> (jhuccp.org). My program so far is this:
>> #!/usr/local/bin/perl -w
>> my ($robotcount, $totalcount) = 0;
>> while (<>) {
>>    next if /^172\.16/;
>>    next if /^.*?jhuccp\.org +?/;
>>    $totalcount++;
>>    if
>> (/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|.*(bot|crawl|spider|search).*?)
>> ..*$/) {
>>       print;
>>       $robotcount++;
>>       }
>>    }
>> print "Robot count is $robotcount\tTotal count is $totalcount\t Ratio
>> is " . $robotcount/$totalcount . "\n";
>>
>> This correctly picks up the numerical IP addresses, but also matches
>> records like this:
>> dup-200-66-146-45.prodigy.net.mx - - [30/Jun/2002:00:03:50 -0400] "GET
>> /prs/sj41/sj41chap1_3.stm HTTP/1.1" 200 9379
>> "http://search.t1msn.com.mx/results.asp?q=relaci%C3%B3n+sexual&origq=yahoo&;
>> FORM=IE4&v=1&cfg=SMCSP&nosp=0&thr=&submitbutton.x=39&submitbutton.y=12"
>> "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
>>
>> Here, the word "search" is in the referrer field.
>>
>> How do I tell it to search only up to the first space character? I
>> think I can do it by defining a second variable that is just the part 
>> of
>> the record up to the first space, and matching on that. But, is there a
>> another way, probably using the 'minimizing' quantifiers?
>
>
> if
> (/^(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\S*(?:bot|crawl|spider|search)\S*
> )\s/)
> {
>
>
> John
> --
> use Perl;
> program
> fulfillment
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
// George Schlossnagle
// Principal Consultant
// OmniTI, Inc          http://www.omniti.com
// (c) 240.460.5234   (e) [EMAIL PROTECTED]
// 1024D/1100A5A0  1370 F70A 9365 96C9 2F5E 56C2 B2B9 262F 1100 A5A0


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to