Chris Cothrun wrote:
> [...]
> contents from HTML pages. My ultimate goal is to strip certain <p> 
> while retaining others, however, the following code doesn't seem to 
> want to strip all the <p> nodes.
> [...]

At the end of the documentation for traverse(), there is this ominous
line:

"(Note: you should not change the structure of a tree while you are
traversing it.)"

...whichis exactly what you're doing.

Logically, then, you should make all structural changes /after/ the
traversal is all done.  HTML::TreeBuilder's tighten_up method is an
example of how to do it: you save up a list of things that need
deleting (e.g., in an array called @to_delete), and then delete them.

(The messy bit with tighten_up is that since it deletes text segments, it
can't just store them someplace and call "$_->delete foreach
@to_delete" later.  Instead it notes them as child X of parent Y, and
then calls "splice(parent-Y-content, X, 1)".  Well, basically.)


About your traverser -- note that you don't need to visit nodes in
both pre- and post-order; and you don't need to visit text nodes; and
if you're going to delete a node, there's no reason to go visiting all
its children.  You could just say:

  my @to_delete;
  $tree->traverse(
    [
      sub { # pre-order traverser
        if($_[0]->tag() eq 'p') {
          push @to_delete, $_[0];
          return HTML::Element::PRUNE; # or just 0 will do
        }
        return HTML::Element::OK;
      },
      undef, # no post-order
    ],
    1 # ignore text nodes -- so our traverser sees only elements
  );

A bit ugly, I know.

> Any suggestions? Anything enlightening to read? I did look around 
> some for some examples of using HTML::Element and didn't find 
> too much, I found Randal S.'s Oct 98 Web Techniques article on 
> using it very helpful, anything more l could read?

Hmm, I'm speaking about HTML::Element and HTML::TreeBuilder at YAPC in
June, and this discussion (and other recent questions) has been very
helpful to me in showing what confuses people about HTML::Element.  (I
am glad that the answer to that is no longer "absolutely everything".)

I may end up turning the results into an article for /The Perl
Journal/ -- maybe issue 19 (not the issue that JUST mailed, or the one
after, but the one after THAT).  At the very least, this should help
me improve the docs to Element.


Further ruminations:

I am beginning to think that the mere existence of the traverse()
method may be a mistake, partly because it has so damned many options
and calling syntaxes now (basically all my fault -- and more on the
way!!), and partly because it keeps people from writing their own
recursive routines, which may be simpler.  Consider the simplicity of:

  my @to_delete;
  sub visitor {
    my $node = $_[0];
    if($node->tag eq 'p') {
      push @to_delete, $node;
    } else {
      foreach my $child (grep ref($_), $node->content_list) {
        visitor($child); # recurse
      }
    }
    return;
  }

  visitor($tree);
  foreach my $n (@to_delete) { $n->delete() };
  @to_delete = ();

Or, a bit fancier and tidier, using a wrapper function:

 {
  my @to_delete;
   # visible to only these two routines: start_visiting and visitor

  sub start_visiting {
    @to_delete = ();
    visitor($_[0]);
    foreach my $n (splice @to_delete) { $n->delete() };
    # Neat trick -- just splice(@array) empties the array,
    #  returning its contents
    return;
  }

  sub visitor {
    ...as above...
  }
 }

then just call start_visiting($tree).



Or, if you want to get very fancy and very weird:

sub start_visiting {
  my @to_delete;
  my $visitor; 
  $visitor = sub {
    my $node = $_[0];
    if($node->tag eq 'p') {
      push @to_delete, $node;
    } else {
      foreach my $child (grep ref($_), $node->content_list) {
        $visitor->($child); # recurse
      }
    }
  }; # That's a /recursive/ anonymous subroutine!

  $visitor->($tree); # do it!

  undef $visitor;
   # Break the circularity -- this is necessary
   # for all routines whose frames contain variables
   # that hold references to the routine itself.
   # HURTS, DON'T IT!
  foreach my $n (@to_delete) { $n->delete() };
  return;
}


For extra credit, consider why the two lines "my $visitor" and
"$visitor = sub {" CANNOT be combined into "my $visitor = sub {".

-- 
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/

Reply via email to