pondering HTML::Element

Sean M. Burke Thu, 6 Jan 2000 23:05:12 -0800
When I was tossing pages at HTML::TreeBuilder to see what it did, I
observed a few things, that started me on a line of thought:

It's quite common for an HTML document to end up as a tree containing
several hundred HTML::Element objects.  (Often over 400, rarely over
1,500, and almost never over 9,000.)  That's HTML::Element objects, not
nodes in the parse tree -- since some leaf nodes may be just "text
segments", which are just strings in some HTML::Element node's
child-list, not objects.

I also note that the typical (i.e., average) HTML element has /no/
attributes (no foo="bar" pairs), and that having more than five
attributes is almost unheard-of.

So that means that the typical object looks like simply this:

  {
   '_mother' => ...an_object...,
   '_content' => [ ...objects and/or segments... ],
   '_tag' = 'blockquote'  # or whatever tag name
  }

And these things have been making me worry a bit about whether it
might not be better to store these element-objects as blessed arrays,
as to consume less memory.  I can picture several ways to do this, but
first off I'd better ask:

* Does anyone know /how/ big of a memory win this'd be?  I.e., what
the difference in memory consumption is between a thousand objects as
above, versus, say a thousand objects consisting of just this:
  [ 'blockquote', mother_object, 0, ...content_objects...]

* Is there a desire on the part of users of HTML::Element (whether
directly or via HTML::TreeBuilder) to have it be more
memory-efficient?

* Does anyone write applications using HTML::Element that break
encapsulation on HTML::Element objects?  That is, by accessing object
contents directly (like $node->{"id"}) instead of using accessors,
like $node->attr("id")?

* Does anyone have any applications that actually /move/ nodes around
in a tree of HTML::Element objects?  As opposed to simply taking the
structure HTML::TreeBuilder gives you, and traversing it, but never
changing it?

* Does anyone do /anything/ with HTML::Element trees, aside from
traversing the tree and read attributes off of nodes?  If so, do tell.

* In short, what do you all use TreeBuilder for?


Incidentally, what I'm thinking of could take two forms:

1) Modifying HTML::Element to store objects as blessed arrayrefs,
instead of as hashes.  Obviously, this will break anything that
currently breaks encapsulation on HTML::Element objects.

Or:

2) Not changing HTML::Element at all, but making a new class for nodes
that provides at least a subset of HTML::Element's interface.  I'd
then provide a way to turn trees made of HTML::Element objects into
trees made of objects of this more memory-efficient class.  (Speed of
access wouldn't be significantly slower, incidentally, assuming
elements wouldn't have excessive numbers of attributes each.)  Making
this class would be a bit simpler if I provided only a subset of
HTML::Element's interface, esp. if it were read-only (i.e., no moving
or deleting nodes or adding or changing attribute values).

Another sub-possibility for case 2, by the way, is to have
HTML::TreeBuilder generate trees out of nodes not of class
HTML::Element, but instead of the more memory-efficient class.  That
way, you'd not even temporarily use much memory.  (Assuming that
having an array instead of a hash-and-an-array is a big win; otherwise
there's no point.)

An elaboration on case 2 is to have a wrapper class around nodes, a
class encapsulating whole trees.  The win here is that the
tree-wrapper (which you could basically only traverse, or that plus
some operations based on traversing) wouldn't have links back to
itself, so you wouldn't have to explicitly call $tree->delete on it;
when it'd pass out of scope, it'd garbage-collect automatically, for
whatever that's worth to anyone.

Thoughts, anyone?  Comments?

-- 
Sean M. Burke  [EMAIL PROTECTED]  http://www.netadventure.net/~sburke/
pondering HTML::Element

Reply via email to