* Sean M. Burke wrote:
>So the C-Tidy-library builds a document tree for some HTML file, and then
>SAX walks the tree so that you can, via Perl, build a new (in-Perl) tree
>for it, using the tree library of your choice (XML::DOM, XML::Element, or
>even some crazy thing called HTML::Element) ?
Yes. Something similar goes for HTML::Parser-style events. This is one
reason, why HTML::Tidy could not support all HTML::Parser events, e.g.
informations like 'offset', 'line', 'column' or 'tokens' get lost in the
parsing/cleanup process. Ok, it would be possible by making a lot of
changes in Tidy, but I don't think it's worth the effort, Tidy's power
_is_ the clean-tree generation.
>I dimly (mis?)remember looking at Tidy's internals months ago, and I think
>I remember that it stored everything as double-byte Unicode strings -- so I
>presume that those get UTF8ified (and tagged as such) when passed to Perl,
>right?
Tidy stores all character data as UTF-8 encoded char*s. They will be
passed as UTF-8 to Perl (tagged as such via SvUTF8_on()) or, for the
pretty-printer, in your desired encoding (if supported).
>Does it deal nicely with non-UTF8 non-Latin-1 input encodings?
Currently tidy supports
* us-ascii *
* iso-8859-1 *
* windows-1252
* mac-roman
* the iso-2022 family *
* utf-8*
[*] denotes supported output encoded
I'm not sure about UTF-16, if it doesn't support it, I'll add support
for it. We have feature requests for
* ShiftJIS
* BIG5
You have, however, to declare what encoding you are using. One might use
Unicode::Map8 or Text::Iconv to convert strings to utf-8 before passing
them to Tidy.
--
Bj�rn H�hrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dageb�ll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/