At 10:00 PM 2001-10-29 -0600, ADJE WebMail Technical Support Team wrote:
>Question: How do I extract the plain text from an HTML file, or, put
>another way, how do I remove the html markups, just leaving the plain
>text?[...]
Since we're all contributing every possible way to do it, here's my favorite:
use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser->new('home.html') || die;
$p->{textify} = {}; # if you don't want IMG/APPLET alts
print $p->get_text('BOING');
# That gets all text up to end of document, as there is
# no "<BOING>"; and even if there were, it'd be smashed
# treated as "boing" anyway.
I'm currently fond of saying that if you learn one tool for dealing with
HTML, it should be HTML::Tree, and if you learn two, it should be
HTML::Tree and HTML::TokeParser. HTML::Parser is used by those two, but I
disencourage people from using it themselves. It leads to poor reinvention
of wheels. (Sometimes you do need to use HTML::Parser, but then you'll go
do it regardless of my discouraging you.)
--
Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/