Re: HTML-Tree 0.61

Sean M. Burke Thu, 16 Dec 1999 15:00:47 -0800
Gisle Aas wrote:
> I have now looked through the new code and must simply say that I am
> impressed.  It looks like it will actually work now  :-)

Why, thank you.  I'd been meaning to poke at it for years now; and in
doing so (in /meaning/ to poke it it, that is), I ended up getting
side-tracked in many strange ways.  E.g., Tree::DAG_Node grew out of
me wanting to do more ornate tree-ey things than Element made
particularly easy.

Thanks, by the way, to Marek Rouchal, for encouraging me to add things
to Element that would do some of the same kinds of things that
DAG_Node does -- like unshift_content, splice_content, etc; and so
that when you detach a node, the detacher method kills the parent's
link to the child as well as the child's link to the parent; and that
when you attach a node, it detaches it from any node it might already
be under, and so forth.

> * We should really have a common module (perhaps HTML::DTDdata)
>   that just contain the information that can be extracted about
>   HTML elements/attributes from HTML DTDs. For instance
>   %linkElements should not have to be maintained both in
>   HTML::LinkExtor and HTML::Element. [...]

I was thinking just the same thing when I was modding Element and
TreeBuilder.

Minor digression:  one day a few weeks back, after I'd boggled over the
TreeBuilder code (before I gave in and put in all the print "I'm doing
[whatever]!\n" if $Debug > 1 hooks you see in the code, whereupon
everything made sense), I thought, "Well, I might as well just look at
the Mozilla source and see how their parser works, and make one in
Perl that works the same way."  So I want and DL'd the Mozilla source.
I discovered that the ten THOUSAND files in the source dist
collectively comprise the not so much a program, as a portal to Hell.

But for future reference of anyone else on this list, I'll summarize
my results:

* mozilla-19990128/htmlparser/tests/html/ contains lots of examples of
BAD HTML that proved useful in testing TreeBuilder.  (Much of it was
of the sort, "yup, that's BAD code, that's why it don't parse!", but
some was helpful.)

* The files
    mozilla-19990128/htmlparser/src/nsElementTable.cpp
    mozilla-19990128/htmlparser/src/nsHTMLTags.cpp
lay out NS's internal table of what elements exist, and what
structural restrictions there are on them.  (The latter is expressed
in a rather un-DTD-like way, by the way.)


But back to the idea of a Perl table containing DTD-like
information...
I fiddled a bit with parsing the XHTML DTD, and it was a bit
frustrating.  But what I was after was the content-models; maybe those
were the most problematic part.
I was working on a CM-to-regexp translator, which I think I have
working right, by the way -- for XML content models, that is -- SGML
CMs permit the & operator, which has no straightforward equivalent in
rexexp.  This was so HTML::AsSubs could do runtime checking of content
models.  (A feature I've not yet gotten around to adding.)

But anyway: I was thinking about dealing with the DTD, but then
come frustratedly to the opinion that parsing the DTD was more trouble
than its worth (i.e., took more time to write the DTD-parser program
than it would take to manually copy relevant bits from a DTD), and I
didn't see a massive advantage to having it be automatable.  Can you
think of any?

Moreover, it occurred to me that some of the tables I wanted to move
out of TreeBuilder and Element contained information that would not be
found in a DTD in any straightforward way.

For example, what attributes that represent links may be of type
"%URI;" (that being just a mnemonic entity for CDATA), but to get
that, you have to be using a DTD that doesn't just say <!ATTLIST base
href CDATA>.  And the table called %canTighten that I use for deleting
ignorable whitespace isn't something that can be pulled out of a DTD
in any straightforward manner I can see.

So while I'd hardly want to cite Mazilla as an example of the right
way to do anything at all, it could be that what's needed is a module
that, like Mozilla's, sets all kind of flags and tables for particular
entries, somewhat independent of a DTD.  (Altho it might be a good
idea to start with one of those DTD utilities of yours to at least
report what new tags are in a given DTD that aren't in the
HTML::DTDdata (or HTML::TagSet, or HTML::Lexicon, or HTML::Parameters,
or HTML::Known, to recall some names I was playing with.)

I figured that, to keep it simple, the module could consist of just
iterating over a long list of hashrefs like:
{  name => 'br',
   content => 'EMPTY',
   can_tighten => 1,
   is_phrasal => 1,
}
and setting entries in the appropriate hashes as necessary.

> * Can't the $verbose_for_text argument to HTML::Element->traverse
>   just be eliminated and assumed to be TRUE always.  Adding arguments
>   should not (normally) break anything.

Hm, I think you're right.  I couldn't think of a situation where it
could break anything, but was unsure.  Yes, I might as well remove the
option.

> * If find_by_tag_name/find_by_attribute was defined to return the
>   first element found in scalar context, then they could be
>   modified to stop searching as soon as an element is found.
>   Currently they will return the number of elements found
>   in scalar context I think.
> 
> * 'attr_get_i' is a strange name I think.  What does the "_i" mean?

I now realize this is a bit of a solecism of mine, from
Class::Classless -- "i" for "with inheritance", and the "get" is to
make it clear that unlike attr, parent, etc, this method can't be used
to set optionally set the value in question.

> * I think we still have a memory leak.  I need to investigate.
That's quite puzzling.  Maybe someone who has Devel::Peek can
troubleshoot?

-- 
Sean M. Burke  [EMAIL PROTECTED]  http://www.netadventure.net/~sburke/
Re: HTML-Tree 0.61

Reply via email to