I'm searching large ASCII files for keywords. The keywords are part of section headings. These headings are in all caps on lines by themselves.

The files sometimes contain HTML tags. My logic handles this well enough, but combs through the HTML very slowly. I'm dealing with tens of thousands of files, so speed counts.

I thought I'd get around this by using HTML::TokeParser to remove any HTML before I searched each file. But now the script processes EVERY file slowly, taking a few seconds for each.

Any suggestions on how I might optimize the following code, or what I could be doing better?

-- Craig


# slurp file into variable { local $/; $wholefile = <IN>; }

# remove HTML tags from variable, leaving only text
my $parser = HTML::TokeParser->new (\$wholefile);
while (my $token = $parser->get_token)
{
        next unless $token->[0] eq 'T';
        $wholefile2 = $wholefile2 . $token->[1];
}

foreach $keyword (@all_keywords)
{                       
        my $re = qr
        {
         ( # start of $1 variable
          ( # start of a group                                                  
           (\w+[A-Z])+ # one or more words in caps
            \s+ # one or more spaces
          )* # zero or more groups
          $keyword # the $keyword variable
          \s+ # one or more spaces
          AGREEMENT # the word "AGREEMENT"
         ) # end of $1 variable
        }x;
                                        
        my $wholeRE = qr{^\s*$re\s*$};
                                
        if($wholefile2 =~ /$wholeRE/gm)
        {
                # proceed
        }

}


--- avast! Antivirus: Outbound message clean. Virus Database (VPS): 0511-0, 03/15/2005 Tested on: 3/15/2005 3:28:18 PM avast! - copyright (c) 1988-2004 ALWIL Software. http://www.avast.com



_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to