Marvin Simkin wrote:
> Not to criticize someone's hard work (OK, I guess I am...) but aren't we
> barking up the wrong tree here? Shouldn't it be possible to discover
> everything we need to know about HTML by parsing a DTD? That way as the
> language changes, software would automatically recognize new tags. :)
I spent much of the past year barking up that and other trees, and
I came to the opinion I summarize in the current HTML::TreeBuilder docs:
[startquote]
HTML AND ITS DISCONTENTS
HTML is rather harder to parse than people who write it generally
suspect.
Here's the problem: HTML is a kind of SGML that permits "minimization"
and "implication". In short, this means that you don't have to close
every tag you open (because the opening of a subsequent tag may
implicitly close it), and if you use a tag that can't occur in the
context you seem to using it in, under certain conditions the parser
will be able to realize you mean to leave the current context and
enter the new one, that being the only one that your code could
correctly be interpreted in.
Now, this would all work flawlessly and unproblematically if: 1) all
the rules that both prescribe and describe HTML were (and had been)
clearly set out, and 2) everyone was aware of these rules and wrote
their code in compliance to them.
However, it didn't happen that way, and so most HTML pages are
difficult if not impossible to correctly parse with nearly any set of
straightforward SGML rules. That's why the internals of
HTML::TreeBuilder consist of lots and lots of special cases -- instead
of being just a generic SGML parser with HTML DTD rules plugged in.
[endquote]
(First off, apologies if I misuse the SGML jargon terms "minimization"
and "implication".)
I considered basing a totally new HTML::TreeBuilder entirely on
information extracted from a DTD, namely content models. A lot of
information can be gleaned from CMs. If you have a CM for TABLE that
says "TR+", and a CM for TR that says "TD+", and there's no other CM
that mentions TR or TD, then having a naked "<TD>" in your SGML should
implicate a containing TABLE and TR. So that gives you the whole
concept of "required parents", if you need it. (Now, I don't know
what SGML should/may/must do when it comes across a naked "<LI>" in
code, when there's three elements with the CM of "LI+". Maybe look
for the one that has the minimization/implication bits turned on.)
So, I love abstraction and hate hard-caded special cases as much as
anyone else. But the abandoned the above approach for several
reasons:
* The HTML DTDs are not growing at an alarming rate and in baroque
ways. What things ARE occasionally added to the HTML DTDs are NOT
tags with complicated minimization/implication behavior. They are
tidy things like <Q>...</Q>, not whole new families of messy
implication- and minimization-rich things like TABLE, TR, TD, or, God
help us, P.
* Adding new tags to data tables in Known/TreeBuilder/whatever is not
hard or time-consuming. And given the previous point, it's not
something that will happen often. Moreover, the current DTDs for HTML
are (appropriately, one might say) simply not meant for use in robust
parsing. In linguistic terms, they are a production grammar, not a
robust receptive grammar. Taking W3-released HTML DTDs and turning
them into robust ("sloppy") DTDs is more time-consuming than just
maintaining the Known/TreeBuilder/whatever tables, and involves an
additional layer of abstraction that is not clearly beneficial.
* There are basically two ways to parse HTML: DTDlessly (i.e., with
special cases hardcoded in the parser), like the current TreeBuilder;
or entirely based on a DTD. If you want the latter, just use an SGML
parser. If you want the former, i.e., if you want something /like/
TreeBuilder, use TreeBuilder. I, personally, see no need for anything
else /like/ TreeBuilder /besides/ TreeBuilder. But I invite all
suggestions, if anyone can think of a good reason why there should be
more than one robust HTML syntax-tree-maker-modules in CPAN.
* Abstraction always has a price. Here the price of moving to clean
DTD-based parsing is programmer sanity, clarity of code in
TreeBuilder, and program efficiency and robustness. The current
TreeBuilder is full of special cases because they're fast, they're
pretty robust, and they're easier to write than something that has to
ponder all tags' CMs to before it can know what to do with a naked
<LI>. And pondering SGML CMs can be a very nontrivial operation, if
one assumes that any kind of SGML-valid CM could appear in an HTML DTD.
(And there's no reason to suppose otherwise.)
* Abstraction would have clear benefits if TreeBuilder would benefit
from the new ability of being able to /robustly/ parse /any/ kind of
messy implication- and minimization-rich SGML. However, HTML is the
only SGML instance with this level of rampant malformedness that I or
anyone I know has ever had to deal with.
And robustness here is extremely important -- TreeBuilder is for
making the best of what it's given, and turning it into the syntax
tree that's not actually worse than the bad code it's given. While
I'm not saying that it's /provably impossible/ to ROBUSTLY parse HTML
with an SGML parser plus an HTML DTD, it is my experience that as I
was trying to tune TreeBuilder, I saw many cases where doing the thing
that's right by the DTD's CMs would have very messy results.
Consider, for example, this idiom for adding interlinear space between
one's list-items:
<ol>
<li><p>anise
<li><p>lentils
</ol>
Now, the expected parse of this is:
ol
li
p
"anise"
li
p
"lentils"
And that's what TreeBuilder was doing, and now does.
However, consider this minutely different document:
<ol>
<p><li>anise
<p><li>lentils
</ol>
At one point in tuning TreeBuilder, this parsed as, if I recall:
ol
p
ul (implicit)
li
"anise"
p
li
"lentils"
Why? Because a P can't be a daughter of an OL, so it must be
minimizing the OL. Oh, but then there's an apparently naked LI --
which, faute de mieux, implicates a containing UL. This is quite
straightforwardly correct as far as the DTD goes, but it turns one
tiny point of disagreement between the DTD and the programmer's idea
of the language into a parse that's a great departure from what's
intended.
Actually, it's still invalid SGML: OL's CM is "LI+", not "LI*", so an
OL (as above) with zero LIs in it is wrong -- so what to do with it?
Completely delete it from the parse tree?
(And don't ask me how you'd parametrize, in a DTD, the idea of "yes, P
isn't in UL's content model, but LI is, and LI can contain P, so
implicate a containing LI for that P"!)
What I ended up doing in this particular case is removing anything
like an enforcement of UL/OL/DL CMs, so the above parses as:
ol
p
li
"anise"
p
li
"lentils"
This may or may not make sense for an application that wants to render
that tree -- but then, that's not my problem, as there's LOTS of kinds
of DTD-correct HTML has will make just about any renderer sweat blood,
such as:
<a name='foo'><input type='hidden' name='x' value='2'></a>
<a href='yorp'><hr></a>
So, IN SHORT: I don't see clear benefits from going out of our way to
put information in HTML::Known that would be useful only to
TreeBuilder -- and only to TreeBuilder /only if/ it were rewritten to
have little or no special-casing in its code, and use only information
that comes from a DTD. TreeBuilder is, honest to God, NOT an SGML
parser; it's a passably robust (if credulous) parser hand-hacked for a
uniquely twisted language, HTML, where the typical document is
malformed.
I /am/ considering putting CMs in HTML::Known, but only such CMs as
come straight from the DTD, and which therefore would /not/ be used by
TreeBuilder, but could be used in, say, HTML::AsSubs, as
runtime-checking of the tree the user is in the middle of elaborating.
However, this currently has the status of merely a pet project, unless
someone can think of any other use those CMs would have.
(The other consideration is whether I should accept only XML-style
CMs, as from the XHTML CMs, or all possible SGML CMs. XML-style CMs
always translate trivially into regexps, but SGML-style CMs contain
features that don't, like "(foo+ & bar+ & zop), baz" -- so one would
have to perform a real parse on such things, and compile them into
finite-state autotomaton to check them against a proposed
daughter-list... which brings up the whole unpleasant issue of
whether to use an NFA for these things, even tho, I believe, SGML CMs
must always be deterministic.)
--
Sean M. Burke [EMAIL PROTECTED] http://www.netadventure.net/~sburke/