Recently I observed that the things that make HTML hard to parse also
make it a bit slow to parse -- there's time overhead to all that stuff
in HTML::TreeBuilder that has to check the context of new nodes before
it can insert them.

Now, one can use Data::Dumper to save the contents of a tree, but the
output from the latter (I've not tried the former) often returns a
representation of the tree that is many many times larger than the raw
HTML itself.

So I've just come up with two routines that dump an Element tree as a
big blob of binary goo, which is generally no bigger than (and often
somewhat smaller than) the corresponding HTML source.  It's very fast
because it doesn't work for just any structure, but works just for
HTML::Element tree structures.  (For example, it doesn't have to deal
with deep-dumping reference structures -- it assumes that _parent and
_content are the only attributes that will have references as values,
and that Element trees will be composed only of HTML::Element objects,
and text segments.)

In practice, it's not much different than writing $tree->as_HTML to a
file, and then reading it back in with a call to TreeBuilder's
parse_file -- but it's much faster for writing as well as reading.

Just out of curiosity, would anyone find this of much use?  I wrote it
partly just for kicks, but also because I was constantly having to
re-re-read the same local unchanging HTML file (against which I was
trying to test particular ways of scanning its content) and kept
thinking "good God, this is slow".  Granted, I'm still using a pre-XS
version of HTML::Parser, and it was a complicated and largish file.

Currently the routines, tentatively named "imblobulate" and
"deblobulate" (blob = binary large object), only write to / read from
a filehandle you pass them.  I'm curious whether anyone would need
them to return or accept a specified scalar value.  At least for
deblobulating, the internal logic for reading from a FH and decoding
from a passed string would be quite different.

-- 
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/

Reply via email to