On Sat, Sep 22, 2007 at 11:53:14PM -0700, David Nesting wrote: > On the other hand, I'm less convinced now that dipping into the HTML or XML > content to figure out the proper encoding is necessarily the proper thing to > do here.
Well, it's often needed since content providers may not have the ability to alter the server's Content-Type header to add the correct charset. On the other hand, it probably depends on what you plan to do with the content. Passing off to a parser (e.g. libxml2) would also figure out the encoding. I have a program that uses LWP and used decoded_content but then I re-encode it before passing it on to the next tool in the chain that also will decode. But, I've also considered parsing the content and removing any content-specified charsets and returning utf8 at all times. > My complaint about LWP::Simple was that the HTTP Content-Type > (charset) information is lost by the time it gets to the caller. If > the data isn't in text at that point, it will never reliably get > there. But for HTML and XML, if the character encoding is actually > specified in the contentrather than in the HTTP headers, then it > isn't as important to deal with it up front. I could see a case > then for dealing with text/* only and returning octets for > everything else, since text/* is the only media type that has > character encoding details in the HTTP headers. That being said, > applications based on LWP::Simple are likely to work better with > HTML and XML "assistance" for the reason I gave earlier: users of > LWP::Simple probably aren't going to take the time to do the proper > parsing and decoding. Yes, it's still "their fault" for not coding > a robust application, but helping them do that is I think still a > valid goal, if we can do it safely. I'd tend to agree. Make LWP::Simple return decoded content and if you need more control don'e use LWP::Simple. -- Bill Moseley [EMAIL PROTECTED]