At 12:44 PM 2003-03-02 -0500, Ed Halley wrote:
I've got a quick-n-dirty script almost working, which uses HTML::TreeBuilder, et al, to find plain text paragraphs. I hoped to get a bunch of text from a number of sources, so I can't get too finicky about each site's idiomatic use of HTML. However, the <P> tag is so loose in its semantics, it can be hard to see how I can get all the text I can.

Yup, I remember that problem from when I was doing Pod::HTML2Pod. It's nasty. I've tried writing general-purpose routines for implicating more P elements, but it's very tricky. For example, consider parsing this:
<blockquote>
Foo
<p>Bar
<p>Baz
</blockquote>


as if it were this:

  <blockquote>
  <p>Foo</p>
  <p>Bar</p>
  <p>Baz</p>
  </blockquote>

For some purposes and users, that's rightheaded and right. For other purposes and users, it's surprising and scary -- two things that make me cringe when I think about putting code into a module and holding it up as The Solution.

[...]Is there a good clean way of traversing to the "previous child",

You can check $element->left (in scalar context)

or tagging plain text ending with <P> as a bona-fide <P>text</P> span?

Well, you could always do something like, to make text siblings of p's into p's themselves:


my %parents; # a hash used as a set
foreach my $p ($root->find_by_tag_name('p')) {
  my $parent = $p->parent;
  $parents{$parent}=$parent;
}
foreach my $parent (values %parents) {
  foreach my $node (@{ $parent->content || next}) {
    # for each text node that has a p sister, replace it
    # node with a new paragraph containing itself
    next if ref $node;
    my $para = HTML::Element->new('p',
      '_parent' => $parent, '_content' => [$node]);
    $node = $para;
  }
}

I'm just writing that code off the top of my head, and not sure if it'll work. Also, using ->content and direct assignments to _parent and _content like this is sort of a "don't try this as home, kids" thing, but not for any technical reason, but just because the content_list (etc) interface is friendlier in many ways. But in this case, using ->content happens to be the easy way, since if you iterate over it with a for, the for variable is directly aliased to the node, so assignment alters the node in-place.
Plus I'm just in an old skool mood today.


So tell me, what kinda stuff are you doing with HTML::Tree ? I'm always curious.

Here, I'll CC the libwww list, since sometimes there's not enough talk about HTML::Tree there.

--
Sean M. Burke    http://search.cpan.org/~sburke/



Reply via email to