* Christopher J. Madsen wrote:
>My repo is https://github.com/madsen/io-html but since it's built with
>dzil, I also made a gist of the processed module to make it easier to
>read the docs: https://gist.github.com/1623654
>
>I took a quick look at HTTP::Message, and I think you'd just need to do
>
>    elsif ($self->content_is_html) {
>       require IO::HTML;
>       my $charset = IO::HTML::find_charset_in($$cref);
>       return $charset if $charset;
>    }
>
>You're already doing the BOM and valid-UTF8 checks; all you need is the
><meta> check, which is what find_charset_in does.

It is not clear to me that the combination would actually conform to the
"HTML5" proposal, for instance, HTTP::Message seems to recognize UTF-32
BOMs, but as I recall the "HTML5" proposal does not allow that.

Your UTF-8 validation code seems wrong to me, you consider the sequence
F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
the chart in <http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design>.

Anyway, if people think this is the way to go, maybe HTTP::Message can
adopt the Content-Type header charset extraction tests in HTML::Encoding
so they don't get lost as my module becomes redundant?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply via email to