Roman @ Melihhov wrote: >How do I safely strip out html tags. > >s!<(.|\n)*>!!gi; > >above construct sometimes removes actual text portions of the document if the line >break within tag reached. Ideas appreciated. > > > Good thing this isn't a FAQ or somebody would quote from that. `perldoc -q html`
perlfaq9: Networking: How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text. Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example. Here's one ``simple-minded'' approach, that works for most files: #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz . ...... There is more to it, you may check out your local FAQ _______________________________________________ Perl-Unix-Users mailing list [EMAIL PROTECTED] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs