Hi,   

I'm using the following code to try to delete <p> items and their 
contents from HTML pages. My ultimate goal is to strip certain <p> 
while retaining others, however, the following code doesn't seem to 
want to strip all the <p> nodes. I'm wondering if my understanding 
of HTML::Element is flawed (quite possible), my usage is bad (also 
possible) or if this is some kind of bug. 

If I call $tree->traverse(\&callbacktest); repeatedly I eventually strip 
out all the <p> nodes but this would make my end goal difficult - to 
identify and strip certain nodes.

Here is the basic, flawed code:

use strict;

use HTML::TreeBuilder;                                  # requires modules from 
HTML::Parse
use HTML::Element;                                              # available from 
http://search.cpan.org/search?module=HTML::Parse

my $p_counter = 0;                                              # count the paragraphs 
- easiest way to identify the ones we don't want?

        foreach my $file_name (@ARGV) {
        my $tree = HTML::TreeBuilder->new;      # empty tree
        $tree->parse_file($file_name);

                print "Hey, here's a dump of the parse tree of $file_name before we 
play with it:\n";
        $tree->dump;                                    # a method we inherit from 
HTML::Element

                $tree->traverse(\&callbacktest); 

        print "Hey, here's a dump of the parse tree of $file_name after we play with 
it:\n";
        $tree->dump; 
        print "And here it is, bizarrely rerendered as HTML:\n",
        $tree->as_HTML, "\n";
    
    $tree = $tree->delete;                              # Now that we're done with it, 
we must destroy it.
  }

sub callbacktest {
        my ($node, $start, $depth) = @_;        # get the values passed to the 
callback function
        if (ref $node) {                                        # does $ node 
reference part of the tree?
                my $currenttag = $node->tag;            # if it does, get the tag 
                if ($currenttag eq "p") {                       # if the tag is <p>, 
delete it and everything inside it.
                        $node->delete;
                        }
                }
        return HTML::Element::OK;
        }


Here is sample HTML:

<html>
        <head>
        </head>
        <body>
                <center>
                        <p>First</p>
                </center>
                <center>
                        <p>Second</p>
                </center>
                <p>Third</p>
                <p>Fourth</p>
                <p>Fifth</p>
                <p>Sixth</p>
                <center>
                        <p>Seventh</p>
                </center>
        </body>
</html>

>From running this code on this sample I still have the Fifth and 
Seventh <p>s in there.

Any suggestions? Anything enlightening to read? I did look around 
some for some examples of using HTML::Element and didn't find 
too much, I found Randal S.'s Oct 98 Web Techniques article on 
using it very helpful, anything more l could read?

Thanks!

Chris Cothrun
[EMAIL PROTECTED]
Chris

Reply via email to