Gisle Aas <[EMAIL PROTECTED]> writes: > The HTTP::Message object now have a decoded_content() method. > This will return the content after any Content-Encodings and > charsets has been decoded.
The current $mess->decoded_content implementation is quite naïve in it's mapping of charsets. It need to either start using Björn's HTML::Encoding module or start doing similar sniffing to better guess the charset when the Content-Header does not provide any. I also plan to expose a $mess->charset method that would just return the guessed charset, i.e. something similar to encoding_from_http_message() provided by HTML::Encoding. Another problem is that I have no idea how well the charset names found in the HTTP/HTML maps to the encoding names that the perl Encode module supports. Anybody knows what the state here is? When this works the next step is to figure out the best way to do streamed decoding. This is needed for the HeadParser that LWP invokes. The main motivation for decoded_content is that HTML::Parser now works better if properly decoded Unicode can be provided to it, but it still fails here: $ lwp-request -d www.microsoft.com Parsing of undecoded UTF-8 will give garbage when decoding entities at lib/LWP/Protocol.pm line 114. Here decoded_content needs to sniff the BOM to be able to guess that they use UTF-8 so that a properly decoded string can be provided to HTML::HeadParser. The decoded_content also solve the frequent request of supporting compressed content. Just do something like this: $ua = LWP::UserAgent->new; $ua->default_header("Accept-Encoding" => "gzip, deflate"); $res = $ua->get("http://www.example.com"); print $res->decoded_content(charset => "none"); Regards, Gisle