Re: [Boston.pm] stripping

Chris Devers Thu, 13 Sep 2001 15:48:17 -0700
On Thu, 13 Sep 2001 [EMAIL PROTECTED] wrote:

> No- not *that* kind ...

*drat*
 
> I'd like to turn HTML files into text files. Right now I use a simple
> regex [s/<[^>]+>//gs], but I'm sure there are many things that will fall
> through the cracks [including charachter entities].
> 
> I feel sure I am not the first one to have this problem. Does anyone
> know of a module or some other resource for doing this? A CPAN review
> was fruitless [but maybe I just missed it].

Indeed. Check out the O'Reilly book _Web Client Programming with Perl_,
available for free online: <http://www.oreilly.com/openbook/webclient/>

Chapter five [1] has code that does exactly what you want:

    To parse the HTML, you can use the HTML module:

        #!/bin/perl
 
        use LWP::Simple;
        use HTML::Parse;
 
        print parse_html(get ($ARGV[0]))->format;

    You can save this version of the program under the name showurl,
    make it executable, and see what happens:

        % showurl http://www.ora.com/
        O'Reilly & Associates
 
        About O'Reilly -- Feedback -- Writing for O'Reilly
 
        What's New -- Here's a sampling of our most recent postings...
 
         * This Week in Web Review: Tracking Ads
           Are you running your Web site like a business? These tools can help.
 
         * Traveling with your dog? Enter the latest Travelers' Tales 
           writing contest and send us a tale.
 
 
        New and Upcoming Releases
        ...

Etc. The book, and particularly that chapter, gives a lot of prose and a
lot of examples about doing this sort of thing. General rule: regular
expressions are a little bit too crude for any moderately complex parsing
of something like an html document, since you can have things like

    <img src="arrow.gif" alt="-->">

that'll be legit html but will break most regexes you can come up with. On
the other hand, if you generate a parse tree then you can sort out this
kind of markup trivially, and there are libraries to do the work for you.




[1] http://www.oreilly.com/openbook/webclient/ch05.html
 

-- 
Chris Devers                     [EMAIL PROTECTED]
Re: [Boston.pm] stripping

Reply via email to