* Sean M. Burke wrote:
>So the C-Tidy-library builds a document tree for some HTML file, and then
>SAX walks the tree so that you can, via Perl, build a new (in-Perl) tree
>for it, using the tree library of your choice (XML::DOM, XML::Element, or
>even some crazy thing called HTML::Element) ?

Yes. Something similar goes for HTML::Parser-style events. This is one
reason, why HTML::Tidy could not support all HTML::Parser events, e.g.
informations like 'offset', 'line', 'column' or 'tokens' get lost in the
parsing/cleanup process. Ok, it would be possible by making a lot of
changes in Tidy, but I don't think it's worth the effort, Tidy's power
_is_ the clean-tree generation.

>I dimly (mis?)remember looking at Tidy's internals months ago, and I think
>I remember that it stored everything as double-byte Unicode strings -- so I
>presume that those get UTF8ified (and tagged as such) when passed to Perl,
>right?

Tidy stores all character data as UTF-8 encoded char*s. They will be
passed as UTF-8 to Perl (tagged as such via SvUTF8_on()) or, for the
pretty-printer, in your desired encoding (if supported).

>Does it deal nicely with non-UTF8 non-Latin-1 input encodings?

Currently tidy supports

  * us-ascii *
  * iso-8859-1 *
  * windows-1252
  * mac-roman
  * the iso-2022 family *
  * utf-8*

[*] denotes supported output encoded

I'm not sure about UTF-16, if it doesn't support it, I'll add support
for it. We have feature requests for

  * ShiftJIS
  * BIG5

You have, however, to declare what encoding you are using. One might use
Unicode::Map8 or Text::Iconv to convert strings to utf-8 before passing
them to Tidy.
-- 
Bj�rn H�hrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dageb�ll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Reply via email to