On Wed, 9 Feb 2005, Dan Armstrong wrote: > I'm using a regular expression to extract text from an html file.
Why? Regular expressions are really bad at analyzing complex, frequently malformed data like HTML. Your request is an example of that: you're matching on a very specific <font> tag, but what if the tag is different? Legit HTML can have the tar attributes in different order, so that tags like these are all functionally identical: <FONT SIZE=2 COLOR="#0000FF"> <FONT COLOR="#0000FF" SIZE=2> <font size="2" color="#0000FF"> <font size="2" color="#00F"> These would all need separate expressions, or an over-complex expression to capture them all at once. It's painful and there's a vast number of such quirks to account for. Why bother fighting it this way? You're *much* better off if you attack the problem with a proper parser, such as HTML::Parser, HTML::SimpleParse, or HTML::TokeParser::Simple: <http://cpan.uwinnipeg.ca/dist/HTML-Parser> <http://cpan.uwinnipeg.ca/dist/HTML-SimpleParse> <http://cpan.uwinnipeg.ca/dist/HTML-TokeParser-Simple> Each of these may have some small learning curve, but once you get going with it, analyzing data like HTML gets *much* easier to do. The path you're on now really isn't worth bothering with. Use a parser. -- Chris Devers -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>