Re: Charset converting

Rune Saetre Thu, 10 Mar 2005 05:04:18 -0800

Hi

I just want to remind you that if there is no charset specified in the
HTTP Content-Type header this means that the document is encoded using the
iso-8859-1 charset.

The RFC is explicitely states that if no charset is indicated in the
Content-Type header the recipient should not try to guess the charset, but
the charset is in fact iso-8859-1.

Of course, "web applications" might not know what charset the webserver
would announce in the Content-Type header, so it makes sense to let this
be overridden in the document preamble. (I guess this is not strictly
legal regarding to the HTTP RFC...)

But once again, if no charset is indicated whatsoever, then the document
should be interpreted according to the iso-8859-1 charset, and NOT utf-8
(or whatever).

Here in Norway we have a lot of servers not indicating any charset at all,
but they usually do send out content encoded with the iso-8859-1 charset,
and is thus RFC compliant.

These sites did not to work with kannel earlier, so I sent a patch some
time ago to fix this. It should now be in CVS.

It wouldn't do much good for speed, but if conversion fails using the
implied iso-8859-1 charset, one could perhaps try another conversion based
on a "qualified guess" at the charset? In this way Kannel would be RFC
compliant so far as the documents are, and then still be able to process
non-RFC compliant documents with meaningfull content.

Regards
Rune

---
Rune S�tre <[EMAIL PROTECTED]>
NetCom as, Infrastruktur
Telefon (mob): 934 34 285
..

On Wed, 9 Mar 2005, Paul P Komkoff Jr wrote:

> Replying to Yury.Mikhienko:
> > fix that after rollback the
> > http://www.kannel.org/cgi-bin/viewcvs.cgi/gateway/gwlib/http.c.diff?r1=1.225&r2=1.226&sortby=date
> > patch
> > now log is:
> > 2005-03-09 17:55:04 [22126] [7] DEBUG: WML compiler: Charset is <>
> > 2005-03-09 17:55:04 [22126] [7] DEBUG: WML compiler: Encoding is <UTF-8>
> > -- may be reason in that?
>
> Yes.
> Here, in russia, there are lot of content-provider servers without
> explicit charset specified in Content-type: header, but body supplied
> in UTF-8. Or, charset is specified in preamble.
>
> In patched version I use I just commented out that fragment, and
> assuming charset = UTF-8 in such cases.
>
> Actually there are many many corner cases, and our content-provs
> manage to hit them all. For example, they can have apache-rus with
> encoding on, which will do funny things with content - but <?xml
> encoding= ?> preamble is obviously untouched. In this case we should
> trust HTTP headers. But some other provider thinks that he can add
> Content-type: ...; charset=ISO8859-5 or whatever and then load a bunch
> of wmls with different encoding= in xml preamble on ...
>
> So for now I'm stick with that logic:
> - Check if xml document contains encoding= in preamble. If it is, then
>   assume charset == preamble value
> - If previous check was negative, try to get charset from HTTP headers
> - If previous was negative, then charset = UTF-8
> - If charset != UTF8 and charset is not accepted by device, and UTF-8
>   accepted by device, then recode body from charset to UTF-8, strip
>   <?xml ... ?> preamble, and set charset = UTF-8
> - (same for ISO8859-1)
> - do rest of content processing.
>
> This avoids most encoding glitches and double-encoding bugs.
>
> Hope this helps.
>
> --
> Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key
>  This message represents the official view of the voices in my head
>
>

Re: Charset converting

Reply via email to