At 08:49 AM 2000-12-04 +0100, Marek Rouchal DAT CAD HW Tel 25849 wrote:
>[...]
>In order to include raw HTML the user supplies with =for and =begin, I
>need to parse it with HTML::Treebuilder to turn it into nodes of type
>HTML::Element for inclusion in what Pod::HTML produces (namely a
>[...]
That reminds me of a more general question:
I generally say that TreeBuilder is for parsing only whole documents -- in
the same sense that a hammer is for banging on things. It's okay to try to
use TreeBuilder to parse document-fragments, the same as it's okay to try
to use a hammer as a foreceps -- but in both cases it will take some
improvisation and cleverness on your part.
But one thing I think might be helpful is a method I wrote, and keep
meaning to put in the next version of Element:
sub HTML::Element::highest_explicits {
my(@stack) = ($_[0]);
my @out;
my $this;
while(@stack) { # idiom for preorder traversal
if(
ref($this = shift @stack)
and $this->{'_implicit'}
) {
unshift @stack, @{$this->{'_content'} || next};
# traverse it
} else {
push @out, $this; # and don't traverse under this
}
}
return @out;
}
When you say
$treelet->eof(); #don't forget to do this!
@docfrag = $treelet->highest_explicits
you get the list of the highest non-implicit (=explicit) element nodes in
the tree. It is possible to get really odd results out of this, but only
with nonsensical input code, I think. (This might be followed by something
like: for(@docfrag) { $_->detach if ref($_) }; )
Everything is happier, BTW, if input code is zero or more self-contained
elements (as opposed to ending on an incomplete element anywhere in there).
If you needed to see /whether/ that's the case, see what the $treelet->pos
is. In theory, if it points to an explicit element, the source
code-fragment wasn't complete. However, this would scream in the case of:
<p>foo
because the pos is still on the explicit p there. So consider something
where you forgive left-open elements whose end tags are normally omissible,
like maybe:
require HTML::Tagset;
@up_pos = ($pos, $pos->lineage);
my $saw_incomplete;
foreach my $e (@up_pos) {
++$saw_incomplete unless $e->implicit
or $HTML::Tagset::optionalEndTag{$e->tag};
}
die "GLEIVEN! GLAH $saw_incomplete!" if $saw_incomplete;
or maybe you could get away with just:
require HTML::Tagset;
@up_pos = ($pos, $pos->lineage);
my $saw_incomplete;
foreach my $e (@up_pos) {
last if !$e->implicit;
++$saw_incomplete unless $HTML::Tagset::optionalEndTag{$e->tag};
}
die "GLEIVEN! GLAH $saw_incomplete!" if $saw_incomplete;
I'm not sure there'd be a practical difference, assuming sane code.
But I'm not sure how it'd behave with mildly strange code, like any of:
<td><li>hoohah!</li>
<td><li>hoohah!</li></td>
<li><td>hoohah!</td>
<li><td>hoohah!</td></li>
But having to assume sane input is not too much of a problem -- assuming
that everyone agrees on what sanity is, and that it /is/ tested on some
examples of reasonably sane code.
--
Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/