* Christopher J. Madsen wrote: >My repo is https://github.com/madsen/io-html but since it's built with >dzil, I also made a gist of the processed module to make it easier to >read the docs: https://gist.github.com/1623654 > >I took a quick look at HTTP::Message, and I think you'd just need to do > > elsif ($self->content_is_html) { > require IO::HTML; > my $charset = IO::HTML::find_charset_in($$cref); > return $charset if $charset; > } > >You're already doing the BOM and valid-UTF8 checks; all you need is the ><meta> check, which is what find_charset_in does.
It is not clear to me that the combination would actually conform to the "HTML5" proposal, for instance, HTTP::Message seems to recognize UTF-32 BOMs, but as I recall the "HTML5" proposal does not allow that. Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in <http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design>. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/