Re: Clipping ALL Occurrences of a Regex in an HTML File?

Chris Devers Wed, 09 Feb 2005 12:03:54 -0800

On Wed, 9 Feb 2005, Dan Armstrong wrote:

> I'm using a regular expression to extract text from an html file.


Why?

Regular expressions are really bad at analyzing complex, frequently 
malformed data like HTML. Your request is an example of that: you're 
matching on a very specific <font> tag, but what if the tag is 
different? Legit HTML can have the tar attributes in different order, so 
that tags like these are all functionally identical:

    <FONT SIZE=2 COLOR="#0000FF">
    <FONT COLOR="#0000FF" SIZE=2>
    <font size="2" color="#0000FF">
    <font size="2" color="#00F">

These would all need separate expressions, or an over-complex expression 
to capture them all at once. It's painful and there's a vast number of 
such quirks to account for. 

Why bother fighting it this way?

You're *much* better off if you attack the problem with a proper parser, 
such as HTML::Parser, HTML::SimpleParse, or HTML::TokeParser::Simple:

    <http://cpan.uwinnipeg.ca/dist/HTML-Parser>
    <http://cpan.uwinnipeg.ca/dist/HTML-SimpleParse>
    <http://cpan.uwinnipeg.ca/dist/HTML-TokeParser-Simple>

Each of these may have some small learning curve, but once you get going 
with it, analyzing data like HTML gets *much* easier to do.

The path you're on now really isn't worth bothering with. Use a parser.



-- 
Chris Devers

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Clipping ALL Occurrences of a Regex in an HTML File?

Reply via email to