I'm trying to do a quick-n-dirty (well, I've been at work on it three hours now) analysis of Apache web logs. I'm trying to count the number of records from robots or spiders. For my purposes, a robot or spider is a request from either an unresolved IP address, or one that has "bot", "spider", "crawl" or "search" in it's resolved domain name. I don't count at all requests that come from my LAN (172.16.0.0/16) or domain (jhuccp.org). My program so far is this: #!/usr/local/bin/perl -w my ($robotcount, $totalcount) = 0; while (<>) { next if /^172\.16/; next if /^.*?jhuccp\.org +?/; $totalcount++; if (/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|.*(bot|crawl|spider|search).*?) ..*$/) { print; $robotcount++; } } print "Robot count is $robotcount\tTotal count is $totalcount\t Ratio is " . $robotcount/$totalcount . "\n";
This correctly picks up the numerical IP addresses, but also matches records like this: dup-200-66-146-45.prodigy.net.mx - - [30/Jun/2002:00:03:50 -0400] "GET /prs/sj41/sj41chap1_3.stm HTTP/1.1" 200 9379 "http://search.t1msn.com.mx/results.asp?q=relaci%C3%B3n+sexual&origq=yahoo&FORM=IE4&v=1&cfg=SMCSP&nosp=0&thr=&submitbutton.x=39&submitbutton.y=12" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" Here, the word "search" is in the referrer field. How do I tell it to search only up to the first space character? I think I can do it by defining a second variable that is just the part of the record up to the first space, and matching on that. But, is there a another way, probably using the 'minimizing' quantifiers? Thanks for your thoughts. -Kevin Zembower ----- E. Kevin Zembower Unix Administrator Johns Hopkins University/Center for Communications Programs 111 Market Place, Suite 310 Baltimore, MD 21202 410-659-6139 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]