At 10:00 PM 2001-10-29 -0600, ADJE WebMail Technical Support Team wrote:
>Question: How do I extract the plain text from an HTML file, or, put
>another way, how do I remove the html markups, just leaving the plain
>text?[...]

Since we're all contributing every possible way to do it, here's my favorite:

  use strict;
  use HTML::TokeParser;
  my $p = HTML::TokeParser->new('home.html') || die;
  $p->{textify} = {};  # if you don't want IMG/APPLET alts
  print $p->get_text('BOING');
   # That gets all text up to end of document, as there is
   # no "<BOING>"; and even if there were, it'd be smashed
   # treated as "boing" anyway.

I'm currently fond of saying that if you learn one tool for dealing with
HTML, it should be HTML::Tree, and if you learn two, it should be
HTML::Tree and HTML::TokeParser.  HTML::Parser is used by those two, but I
disencourage people from using it themselves.  It leads to poor reinvention
of wheels.  (Sometimes you do need to use HTML::Parser, but then you'll go
do it regardless of my discouraging you.)


--
Sean M. Burke  [EMAIL PROTECTED]  http://www.spinn.net/~sburke/

Reply via email to