Not to stray OT, but that seems like poor logic for counting spiders. The IP address I'm sending this mail from now is not reverse-resolvable, but I'm certainly not a bot. If you're relying on people's honesty in naming choices (e.g. the /(spider|crawl|search|bot)/ domain matching), wouldn't you be better off looking at the USER_AGENT value?
George On Thursday, July 11, 2002, at 05:12 PM, John W. Krahn wrote: > Kevin Zembower wrote: >> >> I'm trying to do a quick-n-dirty (well, I've been at work on it three >> hours now) analysis of Apache web logs. I'm trying to count the number >> of records from robots or spiders. For my purposes, a robot or spider >> is >> a request from either an unresolved IP address, or one that has "bot", >> "spider", "crawl" or "search" in it's resolved domain name. I don't >> count at all requests that come from my LAN (172.16.0.0/16) or domain >> (jhuccp.org). My program so far is this: >> #!/usr/local/bin/perl -w >> my ($robotcount, $totalcount) = 0; >> while (<>) { >> next if /^172\.16/; >> next if /^.*?jhuccp\.org +?/; >> $totalcount++; >> if >> (/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|.*(bot|crawl|spider|search).*?) >> ..*$/) { >> print; >> $robotcount++; >> } >> } >> print "Robot count is $robotcount\tTotal count is $totalcount\t Ratio >> is " . $robotcount/$totalcount . "\n"; >> >> This correctly picks up the numerical IP addresses, but also matches >> records like this: >> dup-200-66-146-45.prodigy.net.mx - - [30/Jun/2002:00:03:50 -0400] "GET >> /prs/sj41/sj41chap1_3.stm HTTP/1.1" 200 9379 >> "http://search.t1msn.com.mx/results.asp?q=relaci%C3%B3n+sexual&origq=yahoo& >> FORM=IE4&v=1&cfg=SMCSP&nosp=0&thr=&submitbutton.x=39&submitbutton.y=12" >> "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" >> >> Here, the word "search" is in the referrer field. >> >> How do I tell it to search only up to the first space character? I >> think I can do it by defining a second variable that is just the part >> of >> the record up to the first space, and matching on that. But, is there a >> another way, probably using the 'minimizing' quantifiers? > > > if > (/^(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\S*(?:bot|crawl|spider|search)\S* > )\s/) > { > > > John > -- > use Perl; > program > fulfillment > > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > // George Schlossnagle // Principal Consultant // OmniTI, Inc http://www.omniti.com // (c) 240.460.5234 (e) [EMAIL PROTECTED] // 1024D/1100A5A0 1370 F70A 9365 96C9 2F5E 56C2 B2B9 262F 1100 A5A0 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]