Re: Doing character encoding/decoding within libwww?

Bill Moseley Sun, 23 Sep 2007 09:39:38 -0700

On Sat, Sep 22, 2007 at 11:53:14PM -0700, David Nesting wrote:
> On the other hand, I'm less convinced now that dipping into the HTML or XML
> content to figure out the proper encoding is necessarily the proper thing to
> do here.


Well, it's often needed since content providers may not have the
ability to alter the server's Content-Type header to add the correct
charset.

On the other hand, it probably depends on what you plan to do with the
content.  Passing off to a parser (e.g. libxml2) would also figure out
the encoding.

I have a program that uses LWP and used decoded_content but then I
re-encode it before passing it on to the next tool in the chain that
also will decode.  But, I've also considered parsing the content and
removing any content-specified charsets and returning utf8 at all
times.

> My complaint about LWP::Simple was that the HTTP Content-Type
> (charset) information is lost by the time it gets to the caller.  If
> the data isn't in text at that point, it will never reliably get
> there.  But for HTML and XML, if the character encoding is actually
> specified in the contentrather than in the HTTP headers, then it
> isn't as important to deal with it up front.  I could see a case
> then for dealing with text/* only and returning octets for
> everything else, since text/* is the only media type that has
> character encoding details in the HTTP headers.  That being said,
> applications based on LWP::Simple are likely to work better with
> HTML and XML "assistance" for the reason I gave earlier: users of
> LWP::Simple probably aren't going to take the time to do the proper
> parsing and decoding.  Yes, it's still "their fault" for not coding
> a robust application, but helping them do that is I think still a
> valid goal, if we can do it safely.

I'd tend to agree.  Make LWP::Simple return decoded content and if you
need more control don'e use LWP::Simple.

-- 
Bill Moseley
[EMAIL PROTECTED]

Re: Doing character encoding/decoding within libwww?

Reply via email to