I've got a quick-n-dirty script almost working, which uses HTML::TreeBuilder, et al, to find plain text paragraphs. I hoped to get a bunch of text from a number of sources, so I can't get too finicky about each site's idiomatic use of HTML. However, the <P> tag is so loose in its semantics, it can be hard to see how I can get all the text I can.
Yup, I remember that problem from when I was doing Pod::HTML2Pod. It's nasty. I've tried writing general-purpose routines for implicating more P elements, but it's very tricky. For example, consider parsing this:
<blockquote>
Foo
<p>Bar
<p>Baz
</blockquote>
as if it were this:
<blockquote> <p>Foo</p> <p>Bar</p> <p>Baz</p> </blockquote>
For some purposes and users, that's rightheaded and right. For other purposes and users, it's surprising and scary -- two things that make me cringe when I think about putting code into a module and holding it up as The Solution.
[...]Is there a good clean way of traversing to the "previous child",
You can check $element->left (in scalar context)
or tagging plain text ending with <P> as a bona-fide <P>text</P> span?
Well, you could always do something like, to make text siblings of p's into p's themselves:
my %parents; # a hash used as a set
foreach my $p ($root->find_by_tag_name('p')) {
my $parent = $p->parent;
$parents{$parent}=$parent;
}
foreach my $parent (values %parents) {
foreach my $node (@{ $parent->content || next}) {
# for each text node that has a p sister, replace it
# node with a new paragraph containing itself
next if ref $node;
my $para = HTML::Element->new('p',
'_parent' => $parent, '_content' => [$node]);
$node = $para;
}
}I'm just writing that code off the top of my head, and not sure if it'll work. Also, using ->content and direct assignments to _parent and _content like this is sort of a "don't try this as home, kids" thing, but not for any technical reason, but just because the content_list (etc) interface is friendlier in many ways. But in this case, using ->content happens to be the easy way, since if you iterate over it with a for, the for variable is directly aliased to the node, so assignment alters the node in-place.
Plus I'm just in an old skool mood today.
So tell me, what kinda stuff are you doing with HTML::Tree ? I'm always curious.
Here, I'll CC the libwww list, since sometimes there's not enough talk about HTML::Tree there.
-- Sean M. Burke http://search.cpan.org/~sburke/
