I'm trying to do a quick-n-dirty (well, I've been at work on it three
hours now) analysis of Apache web logs. I'm trying to count the number
of records from robots or spiders. For my purposes, a robot or spider is
a request from either an unresolved IP address, or one that has "bot",
"spider", "crawl" or "search" in it's resolved domain name. I don't
count at all requests that come from my LAN (172.16.0.0/16) or domain
(jhuccp.org). My program so far is this:
#!/usr/local/bin/perl -w
my ($robotcount, $totalcount) = 0;
while (<>) {
   next if /^172\.16/;
   next if /^.*?jhuccp\.org +?/;
   $totalcount++;
   if
(/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|.*(bot|crawl|spider|search).*?)
..*$/) {
      print;
      $robotcount++;
      }
   }
print "Robot count is $robotcount\tTotal count is $totalcount\t Ratio
is " . $robotcount/$totalcount . "\n";

This correctly picks up the numerical IP addresses, but also matches
records like this:
dup-200-66-146-45.prodigy.net.mx - - [30/Jun/2002:00:03:50 -0400] "GET
/prs/sj41/sj41chap1_3.stm HTTP/1.1" 200 9379
"http://search.t1msn.com.mx/results.asp?q=relaci%C3%B3n+sexual&origq=yahoo&FORM=IE4&v=1&cfg=SMCSP&nosp=0&thr=&submitbutton.x=39&submitbutton.y=12";
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

Here, the word "search" is in the referrer field.

How do I tell it to search only up to the first space character? I
think I can do it by defining a second variable that is just the part of
the record up to the first space, and matching on that. But, is there a
another way, probably using the 'minimizing' quantifiers?

Thanks for your thoughts.

-Kevin Zembower



-----
E. Kevin Zembower
Unix Administrator
Johns Hopkins University/Center for Communications Programs
111 Market Place, Suite 310
Baltimore, MD  21202
410-659-6139

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to