Re: Matching IPs or robots

Wiggins d'Anconia Thu, 11 Jul 2002 19:07:36 -0700

Possibly but possibly not, my previous employer used a self written 
spider that we often populated with a standard USER_AGENT for instance 
mozilla's or IE's so that we would be sure to get a real representation 
of the page/script at the other end.  Other times this didn't matter as 
much to us and we just went with one we made up on our own for a laugh.


Would seem to get the most bang for your buck you would test both, catch 
the ones that you can guess very easily, aka the ones that are common 
and well known with the user agent, and then run the rest through the 
name lookup.

Anyone hiring?
http://danconia.org

George Schlossnagle wrote:
> Not to stray OT, but that seems like poor logic for counting spiders.  
> The IP address I'm sending this mail from now is not reverse-resolvable, 
> but I'm certainly not a bot.  If you're relying on people's honesty in 
> naming choices (e.g. the /(spider|crawl|search|bot)/ domain matching), 
> wouldn't you be better off looking at the USER_AGENT value?
> 
> George
> 
> On Thursday, July 11, 2002, at 05:12 PM, John W. Krahn wrote:
> 
>> Kevin Zembower wrote:
>>
>>>
>>> I'm trying to do a quick-n-dirty (well, I've been at work on it three
>>> hours now) analysis of Apache web logs. I'm trying to count the number
>>> of records from robots or spiders. For my purposes, a robot or spider is
>>> a request from either an unresolved IP address, or one that has "bot",
>>> "spider", "crawl" or "search" in it's resolved domain name. I don't
>>> count at all requests that come from my LAN (172.16.0.0/16) or domain
>>> (jhuccp.org). My program so far is this:
>>> #!/usr/local/bin/perl -w
>>> my ($robotcount, $totalcount) = 0;
>>> while (<>) {
>>>    next if /^172\.16/;
>>>    next if /^.*?jhuccp\.org +?/;
>>>    $totalcount++;
>>>    if
>>> (/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|.*(bot|crawl|spider|search).*?)
>>> ..*$/) {
>>>       print;
>>>       $robotcount++;
>>>       }
>>>    }
>>> print "Robot count is $robotcount\tTotal count is $totalcount\t Ratio
>>> is " . $robotcount/$totalcount . "\n";
>>>
>>> This correctly picks up the numerical IP addresses, but also matches
>>> records like this:
>>> dup-200-66-146-45.prodigy.net.mx - - [30/Jun/2002:00:03:50 -0400] "GET
>>> /prs/sj41/sj41chap1_3.stm HTTP/1.1" 200 9379
>>> "http://search.t1msn.com.mx/results.asp?q=relaci%C3%B3n+sexual&origq=yahoo&; 
>>>
>>> FORM=IE4&v=1&cfg=SMCSP&nosp=0&thr=&submitbutton.x=39&submitbutton.y=12"
>>> "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
>>>
>>> Here, the word "search" is in the referrer field.
>>>
>>> How do I tell it to search only up to the first space character? I
>>> think I can do it by defining a second variable that is just the part of
>>> the record up to the first space, and matching on that. But, is there a
>>> another way, probably using the 'minimizing' quantifiers?
>>
>>
>>
>> if
>> (/^(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\S*(?:bot|crawl|spider|search)\S* 
>>
>> )\s/)
>> {
>>
>>
>> John
>> -- 
>> use Perl;
>> program
>> fulfillment
>>
>> -- 
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
> // George Schlossnagle
> // Principal Consultant
> // OmniTI, Inc          http://www.omniti.com
> // (c) 240.460.5234   (e) [EMAIL PROTECTED]
> // 1024D/1100A5A0  1370 F70A 9365 96C9 2F5E 56C2 B2B9 262F 1100 A5A0
> 
> 



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Matching IPs or robots

Reply via email to