Re: Getting the encoding of a response

Charles C. Fu Wed, 12 May 2004 01:34:45 -0700

In <[EMAIL PROTECTED]> on 11 May 2004,
   Louise M. Mitchell <Mitchell> wrote:
> I need to grab the encoding of pages I'm retrieving with
> LWP::UserAgent... my perusal of the documentation indicated I could use
> the LWP::MediaTypes to get the encoding...


No, that's for guessing when other information is not present.

Look for charset info in $response->header('Content-Type').  If
charset info is not present there, then the HTTP specs say the charset
defaults to ISO-8859-1; but the HTML 4.01 spec says the charset
doesn't default to anything in that case.  The charset information in
the Content-Type header has precedence over other possible sources of
charset info such as an XML declaration or <meta> tag.

If you wish to examine the meta tags, you should use one of the HTML
parsers to parse the response content.

(Finally, if you care, there is also a spec on how to guess the
charset of documents where none has been specified.  The procedure
is basically to go through the response incrementally seeing what
charsets could legally contain all the content encountered so far
until only one remains or the end of the content has been reached.)

-ccwf
-- 
Charles C. Fu                           ,--
Founder                ___  __ __. . ,-/--
Web i18n, LLC              (_,(_,|/|/ /
www.web-i18n.net                 ----'

Re: Getting the encoding of a response

Reply via email to