At 10:29 PM 2001-07-20 +0200, Bjoern Hoehrmann wrote:
>[...]
>HTML Tidy is currently maintained by a group of developers including
>myself at Sourceforge. One of our goals is to create a free-standing
>C-library out of Tidy to ease it's reuse in other applications, see
>[...] I'm going to write an Perl XS interface to this library
>[...] My current
>module provides a simple (XML::Parser::Perl)SAX interface so that I can
>use the module to build up a DOM tree for e.g. XML::DOM or XML::XPath.
So the C-Tidy-library builds a document tree for some HTML file, and then
SAX walks the tree so that you can, via Perl, build a new (in-Perl) tree
for it, using the tree library of your choice (XML::DOM, XML::Element, or
even some crazy thing called HTML::Element) ?
I dimly (mis?)remember looking at Tidy's internals months ago, and I think
I remember that it stored everything as double-byte Unicode strings -- so I
presume that those get UTF8ified (and tagged as such) when passed to Perl,
right? Does it deal nicely with non-UTF8 non-Latin-1 input encodings? One
thing I've not tested with HTML::Tree is how it deals with non-Latin-1
encodings; and if Tidy deals correctly with them, and provides an
alternativel to HTML::Tree, then I would feel a bit less guild-ridden at
the tought that HTML::Tree might be mangling text in Shift-JIS or whatever.
--
Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/