Re: decoded_content

Bjoern Hoehrmann Sat, 04 Dec 2004 13:54:56 -0800

* Gisle Aas wrote:
>The current $mess->decoded_content implementation is quite naïve in
>it's mapping of charsets.  It need to either start using Björn's
>HTML::Encoding module or start doing similar sniffing to better guess
>the charset when the Content-Header does not provide any.


<http://search.cpan.org/dist/HTML-Encoding/>. I very much welcome ideas
and patches that would help here. The module is currently just good
enough to replace the custom detection code in the W3C Markup Validator
check script (which is the basic motivation of the module ever since)
and in that pretty much ad hoc... I do indeed think that the libwww-perl
modules would be a better place for much of the functionality.

>I also plan to expose a $mess->charset method that would just return
>the guessed charset, i.e. something similar to
>encoding_from_http_message() provided by HTML::Encoding.

A $mess->header_charset might be a good start here which just gives the
charset parameter in the content-type header. This would be what  

  HTML::Encoding::encoding_from_content_type($mess->header('Content-Type'))

does. HTTP::Message would be a better place for that code as the charset
parameter is far more common than just HTML/XML (all text/* types have
one, for example). The same probably goes for other things aswell such
as the BOM detection code in HTML::Encoding.

>Another problem is that I have no idea how well the charset names
>found in the HTTP/HTML maps to the encoding names that the perl Encode
>module supports.  Anybody knows what the state here is?

Things might work out in common cases, but it's not quite where I think
it should be, I've recently started a thread on perl-unicode about it,
<http://www.nntp.perl.org/group/perl.unicode/2648>; I found that using
the I18N::Charset is needed in addition to Encode and that I18N::Charset
(still) lacks quite a number of mappings (see the comments in the source
of the module).

>When this works the next step is to figure out the best way to do
>streamed decoding.  This is needed for the HeadParser that LWP
>invokes.

One problem here are stateful encodings such as UTF-7 or the ISO-2022
family of encodings as Encode::PerlIO notes (and attempts to work around
for many encodings). For example, the code you posted to perl-unicode
(re incomplete sequences) would fail for UTF-7 "Bj+APY-rn" if it happens
to split the string after "Bj+APY" which would be a complete sequence
but the meaning of the following "-rn" depends on the current state of
the decoder which decode() does not maintain, so it might sometimes
decode to "Bjö-rn" and sometimes to "Björn" which is not desirable (it
might have security implications, for example).

I am not sure whether there is an easy way to use the PerlIO workarounds
without using PerlIO. I've tried using PerlIO::scalar in HTML::Encoding,
but <http://www.nntp.perl.org/group/perl.unicode/2675> it modifies the
scalar on some encoding errors and I did not investigate this further.
Maybe Encode should provide a simpler means for decoding possibly incom-
plete sequences...

Also, HTML::Parser might be the best blace to deal at least with the
case where the (or an) encoding is already known so it would decode the
bytes passed to it itself, I would then probably replace my poor custom
HTML::Encoding::encoding_from_meta_element with HTML::HeadParser looping
through possible encodings (probably giving up once that worked out, it
would currently decode with UTF-8 and ISO-8859-1 for most cases which is
quite unlikely to return different results...)
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: decoded_content

Reply via email to