When I was tossing pages at HTML::TreeBuilder to see what it did, I
observed a few things, that started me on a line of thought:
It's quite common for an HTML document to end up as a tree containing
several hundred HTML::Element objects. (Often over 400, rarely over
1,500, and almost never over 9,000.) That's HTML::Element objects, not
nodes in the parse tree -- since some leaf nodes may be just "text
segments", which are just strings in some HTML::Element node's
child-list, not objects.
I also note that the typical (i.e., average) HTML element has /no/
attributes (no foo="bar" pairs), and that having more than five
attributes is almost unheard-of.
So that means that the typical object looks like simply this:
{
'_mother' => ...an_object...,
'_content' => [ ...objects and/or segments... ],
'_tag' = 'blockquote' # or whatever tag name
}
And these things have been making me worry a bit about whether it
might not be better to store these element-objects as blessed arrays,
as to consume less memory. I can picture several ways to do this, but
first off I'd better ask:
* Does anyone know /how/ big of a memory win this'd be? I.e., what
the difference in memory consumption is between a thousand objects as
above, versus, say a thousand objects consisting of just this:
[ 'blockquote', mother_object, 0, ...content_objects...]
* Is there a desire on the part of users of HTML::Element (whether
directly or via HTML::TreeBuilder) to have it be more
memory-efficient?
* Does anyone write applications using HTML::Element that break
encapsulation on HTML::Element objects? That is, by accessing object
contents directly (like $node->{"id"}) instead of using accessors,
like $node->attr("id")?
* Does anyone have any applications that actually /move/ nodes around
in a tree of HTML::Element objects? As opposed to simply taking the
structure HTML::TreeBuilder gives you, and traversing it, but never
changing it?
* Does anyone do /anything/ with HTML::Element trees, aside from
traversing the tree and read attributes off of nodes? If so, do tell.
* In short, what do you all use TreeBuilder for?
Incidentally, what I'm thinking of could take two forms:
1) Modifying HTML::Element to store objects as blessed arrayrefs,
instead of as hashes. Obviously, this will break anything that
currently breaks encapsulation on HTML::Element objects.
Or:
2) Not changing HTML::Element at all, but making a new class for nodes
that provides at least a subset of HTML::Element's interface. I'd
then provide a way to turn trees made of HTML::Element objects into
trees made of objects of this more memory-efficient class. (Speed of
access wouldn't be significantly slower, incidentally, assuming
elements wouldn't have excessive numbers of attributes each.) Making
this class would be a bit simpler if I provided only a subset of
HTML::Element's interface, esp. if it were read-only (i.e., no moving
or deleting nodes or adding or changing attribute values).
Another sub-possibility for case 2, by the way, is to have
HTML::TreeBuilder generate trees out of nodes not of class
HTML::Element, but instead of the more memory-efficient class. That
way, you'd not even temporarily use much memory. (Assuming that
having an array instead of a hash-and-an-array is a big win; otherwise
there's no point.)
An elaboration on case 2 is to have a wrapper class around nodes, a
class encapsulating whole trees. The win here is that the
tree-wrapper (which you could basically only traverse, or that plus
some operations based on traversing) wouldn't have links back to
itself, so you wouldn't have to explicitly call $tree->delete on it;
when it'd pass out of scope, it'd garbage-collect automatically, for
whatever that's worth to anyone.
Thoughts, anyone? Comments?
--
Sean M. Burke [EMAIL PROTECTED] http://www.netadventure.net/~sburke/