Gisle Aas <[EMAIL PROTECTED]> writes:

>     The HTTP::Message object now have a decoded_content() method.
>     This will return the content after any Content-Encodings and
>     charsets has been decoded.

The current $mess->decoded_content implementation is quite naïve in
it's mapping of charsets.  It need to either start using Björn's
HTML::Encoding module or start doing similar sniffing to better guess
the charset when the Content-Header does not provide any.

I also plan to expose a $mess->charset method that would just return
the guessed charset, i.e. something similar to
encoding_from_http_message() provided by HTML::Encoding.

Another problem is that I have no idea how well the charset names
found in the HTTP/HTML maps to the encoding names that the perl Encode
module supports.  Anybody knows what the state here is?

When this works the next step is to figure out the best way to do
streamed decoding.  This is needed for the HeadParser that LWP
invokes.

The main motivation for decoded_content is that HTML::Parser now works
better if properly decoded Unicode can be provided to it, but it still
fails here:

  $ lwp-request -d www.microsoft.com
  Parsing of undecoded UTF-8 will give garbage when decoding entities
  at lib/LWP/Protocol.pm line 114.

Here decoded_content needs to sniff the BOM to be able to guess that
they use UTF-8 so that a properly decoded string can be provided to
HTML::HeadParser.

The decoded_content also solve the frequent request of supporting
compressed content.  Just do something like this:

   $ua = LWP::UserAgent->new;
   $ua->default_header("Accept-Encoding" => "gzip, deflate");

   $res = $ua->get("http://www.example.com";);
   print $res->decoded_content(charset => "none");

Regards,
Gisle

Reply via email to