I'm trying to analyze web logs records which look like this:

2004-03-28 00:38:31 d7.facsmf.utexas.edu - W3SVC1 DB db.jhuccp.org GET 
/dbtw-wpd/exec/dbtwpcgi.exe 
XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpcgi.exe&BU=http%3A%2F%2Fdb.jhuccp.org%2Fpopinform%2Fbasic.html&QB0=AND&QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&QI0=China%0D%0A&QB1=AND&QF1=Author+%7C+CN&QI1=&MR=10&TN=popline&RF=ShortRecordDisplay&DF=LongRecordDisplay&DL=1&RL=1&NP=0&AC=QBE_QUERY&x=37&y=4
 200 0 21248 814 19391 80 HTTP/1.1 
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705) - 
http://db.jhuccp.org/popinform/basic.html 

In this record, in the tenth space-delimited field, which starts "XC=%2Fdbtw" there 
are variables which start with "QF" followed by a number, for instance 
"QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&"
 This indicates that the fields to be searched in the database are "Abstract 
KeywordsMajor KeywordsMinor..." The same numbered "QI" variable, in this case 
"QI0=China%0D%0A" indicates searching for "China" in these fields.

For every "QF" record, there should be a corresponding "QI" record with the same 
number, although the value might be blank, as in "QF1=Author+%7C+CN&QI1=&". This 
section of the above example indicates that a search should be performed in the 
"Author" and "CN" fields, but the value for "QI1" is blank, so it matches everything.

My program, which I've pasted in below my signature, tries to find a "QF" value, 
matches it to a list of fieldnames ("If the list of fields to be searched contains the 
'Abstract' field, it should be considered a 'subject' search") then grabs the 
corresponding "QI" value, to print it out. However, I can never match anything beyond 
the digit. In my program below, the line:

           print "Match successful!\n" if ($query =~ /QI$1/);

works, but the next three lines:

           $query =~ /QI$1=(.*?)&/;
           $subject = $1;
           print "Subject: $1\n" if ($debug);

never matches anything.

I've been working on this, on and off, the last two days. Any suggestions or pointing 
out my boneheaded errors is gratefully appreciated.  Any other overall suggestions on 
my coding are welcomed. This script seems to run very slowly, due probably to all the 
complex regex.

Thanks for all your help and suggestions.

-Kevin Zembower

centernet:/opt/analog/logdata/db # cat listqueries3.pl 
#!/usr/local/bin/perl

$debug = 1;

while (<>) {
   next unless (/TN=popline/i); #Just analyze the records for the POPLINE database
   
   $subject = $author = $docno = $title = $journalsource = $keywords = $languages = 
$popreporttopic = $refereed = $year = "";
   
   if (/^.* .* .* .* .* .* .* GET [^ ]*dbtwpcgi\.exe .*QI0=[^&]*&.*QI1=[^&]*&.*/){
   
     if (/QI2/) { $type = "A"; } else {  $type = "B"; }
     ($date, $time, $source, $junk, $junk, $host, $FQDN, $method, $file, $query, 
$junk) = split;
     
     while ($query =~ m/QF(\d+)=(\S*?)&/ig) {
        print "fieldnumber = :$1:\tfieldname = $2\n" if ($debug);
        if ($2 =~ /abstract/i) {
           print "Abstract found!\n" if ($debug);
           print "Query: $query\n" if ($debug);
           print "Match successful!\n" if ($query =~ /QI$1/);
           $query =~ /QI$1=(.*?)&/;
           $subject = $1;
           print "Subject: $1\n" if ($debug);
        } elsif ($2 =~ /author/i) {
           $query =~ /QI$1=(\S*?)&/;
           $author = $1;
        } elsif ($2 =~ /endtitle/i) {
           $query =~ /QI$1=(\S*?)&/;
           $title = $1;
        }   
     } #while there are more matches for QFn fields

     $outstring = 
"$type\t$date\t$time\t$subject\t$author\t$title\t$journalsource\t$keywords\t$languages\t$popreporttopic\t$refereed\t$year\n";
     print translate($outstring);
   }# if it's a request for a database query
}# while there are more lines in the input file   

sub translate() {
   $_ = $_[0];
   s/%22/\"/g;
   s/%2C/,/g;
   s/%20/ /g;
   s#%2F#/#g;
   s/%3D/=/g;
   s/%3B/;/g;
   s/%26/&/g;
   s/%0D//g;
   s/%0A//g;
   s/\+/ /g;
   s/%29/)/g;
   s/%28/(/g;
   s/%27/\' /g;
   s/%2b/+/g;
   s/%7C/|/g;
   s/%3A/:/g;
   #Debbie request all boolean logical words and sumbols be replaced with '|'
   s/\b(and)\b/|/ig;
   s/\b(or)\b/|/ig;
   s/&/|/g;
   s[/][|]g;
   $_;
   }
centernet:/opt/analog/logdata/db # cat v
2004-03-28 00:38:31 d7.facsmf.utexas.edu - W3SVC1 DB db.jhuccp.org GET 
/dbtw-wpd/exec/dbtwpcgi.exe 
XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpcgi.exe&BU=http%3A%2F%2Fdb.jhuccp.org%2Fpopinform%2Fbasic.html&QB0=AND&QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&QI0=China%0D%0A&QB1=AND&QF1=Author+%7C+CN&QI1=&MR=10&TN=popline&RF=ShortRecordDisplay&DF=LongRecordDisplay&DL=1&RL=1&NP=0&AC=QBE_QUERY&x=37&y=4
 200 0 21248 814 19391 80 HTTP/1.1 
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705) - 
http://db.jhuccp.org/popinform/basic.html
centernet:/opt/analog/logdata/db # ./listqueries3.pl v
fieldnumber = :0:       fieldname = 
Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb
Abstract found!
Query: 
XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpcgi.exe&BU=http%3A%2F%2Fdb.jhuccp.org%2Fpopinform%2Fbasic.html&QB0=AND&QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&QI0=China%0D%0A&QB1=AND&QF1=Author+%7C+CN&QI1=&MR=10&TN=popline&RF=ShortRecordDisplay&DF=LongRecordDisplay&DL=1&RL=1&NP=0&AC=QBE_QUERY&x=37&y=4
Match successful!
Subject: 
fieldnumber = :1:       fieldname = Author+%7C+CN
B       2004-03-28      00:38:31
centernet:/opt/analog/logdata/db # 

-----
E. Kevin Zembower
Unix Administrator
Johns Hopkins University/Center for Communications Programs
111 Market Place, Suite 310
Baltimore, MD  21202
410-659-6139

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to