On Thu, 13 Sep 2001 [EMAIL PROTECTED] wrote:
> No- not *that* kind ...
*drat*
> I'd like to turn HTML files into text files. Right now I use a simple
> regex [s/<[^>]+>//gs], but I'm sure there are many things that will fall
> through the cracks [including charachter entities].
>
> I feel sure I am not the first one to have this problem. Does anyone
> know of a module or some other resource for doing this? A CPAN review
> was fruitless [but maybe I just missed it].
Indeed. Check out the O'Reilly book _Web Client Programming with Perl_,
available for free online: <http://www.oreilly.com/openbook/webclient/>
Chapter five [1] has code that does exactly what you want:
To parse the HTML, you can use the HTML module:
#!/bin/perl
use LWP::Simple;
use HTML::Parse;
print parse_html(get ($ARGV[0]))->format;
You can save this version of the program under the name showurl,
make it executable, and see what happens:
% showurl http://www.ora.com/
O'Reilly & Associates
About O'Reilly -- Feedback -- Writing for O'Reilly
What's New -- Here's a sampling of our most recent postings...
* This Week in Web Review: Tracking Ads
Are you running your Web site like a business? These tools can help.
* Traveling with your dog? Enter the latest Travelers' Tales
writing contest and send us a tale.
New and Upcoming Releases
...
Etc. The book, and particularly that chapter, gives a lot of prose and a
lot of examples about doing this sort of thing. General rule: regular
expressions are a little bit too crude for any moderately complex parsing
of something like an html document, since you can have things like
<img src="arrow.gif" alt="-->">
that'll be legit html but will break most regexes you can come up with. On
the other hand, if you generate a parse tree then you can sort out this
kind of markup trivially, and there are libraries to do the work for you.
[1] http://www.oreilly.com/openbook/webclient/ch05.html
--
Chris Devers [EMAIL PROTECTED]