Re: Charset converting

Paul P Komkoff Jr Wed, 09 Mar 2005 11:53:02 -0800

Replying to Yury.Mikhienko:
> fix that after rollback the 
> http://www.kannel.org/cgi-bin/viewcvs.cgi/gateway/gwlib/http.c.diff?r1=1.225&r2=1.226&sortby=date
>  
> patch
> now log is:
> 2005-03-09 17:55:04 [22126] [7] DEBUG: WML compiler: Charset is <>
> 2005-03-09 17:55:04 [22126] [7] DEBUG: WML compiler: Encoding is <UTF-8> 
> -- may be reason in that?


Yes.
Here, in russia, there are lot of content-provider servers without
explicit charset specified in Content-type: header, but body supplied
in UTF-8. Or, charset is specified in preamble.

In patched version I use I just commented out that fragment, and
assuming charset = UTF-8 in such cases.

Actually there are many many corner cases, and our content-provs
manage to hit them all. For example, they can have apache-rus with
encoding on, which will do funny things with content - but <?xml
encoding= ?> preamble is obviously untouched. In this case we should
trust HTTP headers. But some other provider thinks that he can add
Content-type: ...; charset=ISO8859-5 or whatever and then load a bunch
of wmls with different encoding= in xml preamble on ...

So for now I'm stick with that logic:
- Check if xml document contains encoding= in preamble. If it is, then
  assume charset == preamble value
- If previous check was negative, try to get charset from HTTP headers
- If previous was negative, then charset = UTF-8
- If charset != UTF8 and charset is not accepted by device, and UTF-8
  accepted by device, then recode body from charset to UTF-8, strip
  <?xml ... ?> preamble, and set charset = UTF-8
- (same for ISO8859-1)
- do rest of content processing.

This avoids most encoding glitches and double-encoding bugs.

Hope this helps.

-- 
Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key
 This message represents the official view of the voices in my head

Re: Charset converting

Reply via email to