The files sometimes contain HTML tags. My logic handles this well enough, but combs through the HTML very slowly. I'm dealing with tens of thousands of files, so speed counts.
I thought I'd get around this by using HTML::TokeParser to remove any HTML before I searched each file. But now the script processes EVERY file slowly, taking a few seconds for each.
Any suggestions on how I might optimize the following code, or what I could be doing better?
-- Craig
# slurp file into variable { local $/; $wholefile = <IN>; }
# remove HTML tags from variable, leaving only text my $parser = HTML::TokeParser->new (\$wholefile); while (my $token = $parser->get_token) { next unless $token->[0] eq 'T'; $wholefile2 = $wholefile2 . $token->[1]; }
foreach $keyword (@all_keywords) { my $re = qr { ( # start of $1 variable ( # start of a group (\w+[A-Z])+ # one or more words in caps \s+ # one or more spaces )* # zero or more groups $keyword # the $keyword variable \s+ # one or more spaces AGREEMENT # the word "AGREEMENT" ) # end of $1 variable }x; my $wholeRE = qr{^\s*$re\s*$}; if($wholefile2 =~ /$wholeRE/gm) { # proceed }
}
--- avast! Antivirus: Outbound message clean. Virus Database (VPS): 0511-0, 03/15/2005 Tested on: 3/15/2005 3:28:18 PM avast! - copyright (c) 1988-2004 ALWIL Software. http://www.avast.com
_______________________________________________ Perl-Win32-Users mailing list Perl-Win32-Users@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs