Re: [Perl-unix-users] regexp question

stormpunk Sun, 09 Jun 2002 10:50:36 -0700

Roman @ Melihhov wrote:

>How do I safely strip out html tags.
>
>s!<(.|\n)*>!!gi;
>
>above construct sometimes removes actual text portions of the document if the line 
>break within tag reached. Ideas appreciated.
>
>  
>
Good thing this isn't a FAQ or somebody would quote from that.
`perldoc -q html`


perlfaq9: Networking: How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML::Parser 
from CPAN. Another mostly correct way is to use HTML::FormatText which 
not only removes HTML but also attempts to do a little simple formatting 
of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like 
s/<.*?>//g, but that fails in many cases because the tags may continue 
over line breaks, they may contain quoted angle-brackets, or HTML 
comment may be present. Plus, folks forget to convert entities--like 
&lt; for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml program 
in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .
......
There is more to it, you may check out your local FAQ


_______________________________________________
Perl-Unix-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: [Perl-unix-users] regexp question

Reply via email to