* Gisle Aas wrote: >The current $mess->decoded_content implementation is quite naïve in >it's mapping of charsets. It need to either start using Björn's >HTML::Encoding module or start doing similar sniffing to better guess >the charset when the Content-Header does not provide any.
<http://search.cpan.org/dist/HTML-Encoding/>. I very much welcome ideas and patches that would help here. The module is currently just good enough to replace the custom detection code in the W3C Markup Validator check script (which is the basic motivation of the module ever since) and in that pretty much ad hoc... I do indeed think that the libwww-perl modules would be a better place for much of the functionality. >I also plan to expose a $mess->charset method that would just return >the guessed charset, i.e. something similar to >encoding_from_http_message() provided by HTML::Encoding. A $mess->header_charset might be a good start here which just gives the charset parameter in the content-type header. This would be what HTML::Encoding::encoding_from_content_type($mess->header('Content-Type')) does. HTTP::Message would be a better place for that code as the charset parameter is far more common than just HTML/XML (all text/* types have one, for example). The same probably goes for other things aswell such as the BOM detection code in HTML::Encoding. >Another problem is that I have no idea how well the charset names >found in the HTTP/HTML maps to the encoding names that the perl Encode >module supports. Anybody knows what the state here is? Things might work out in common cases, but it's not quite where I think it should be, I've recently started a thread on perl-unicode about it, <http://www.nntp.perl.org/group/perl.unicode/2648>; I found that using the I18N::Charset is needed in addition to Encode and that I18N::Charset (still) lacks quite a number of mappings (see the comments in the source of the module). >When this works the next step is to figure out the best way to do >streamed decoding. This is needed for the HeadParser that LWP >invokes. One problem here are stateful encodings such as UTF-7 or the ISO-2022 family of encodings as Encode::PerlIO notes (and attempts to work around for many encodings). For example, the code you posted to perl-unicode (re incomplete sequences) would fail for UTF-7 "Bj+APY-rn" if it happens to split the string after "Bj+APY" which would be a complete sequence but the meaning of the following "-rn" depends on the current state of the decoder which decode() does not maintain, so it might sometimes decode to "Bjö-rn" and sometimes to "Björn" which is not desirable (it might have security implications, for example). I am not sure whether there is an easy way to use the PerlIO workarounds without using PerlIO. I've tried using PerlIO::scalar in HTML::Encoding, but <http://www.nntp.perl.org/group/perl.unicode/2675> it modifies the scalar on some encoding errors and I did not investigate this further. Maybe Encode should provide a simpler means for decoding possibly incom- plete sequences... Also, HTML::Parser might be the best blace to deal at least with the case where the (or an) encoding is already known so it would decode the bytes passed to it itself, I would then probably replace my poor custom HTML::Encoding::encoding_from_meta_element with HTML::HeadParser looping through possible encodings (probably giving up once that worked out, it would currently decode with UTF-8 and ISO-8859-1 for most cases which is quite unlikely to return different results...) -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/