Re: [Demexp-dev] Character encoding

David MENTRE Sun, 21 Oct 2007 23:57:56 -0700

Hello Lyu,

2007/10/22, Lyu Abe <[EMAIL PROTECTED]>:
> There's one thing I do not understand in character coding of the
> server's reply. When I display, for example, tag sets, I can read this:
>
> 'a_tag_label': u'citoyennet\xe9'
>
> in which  " u'citoyennet\xe9' " corresponds to an unicode encoded text,
> right?


Yes.

> Then I do not understand why we get unicode encoded strings,
> while DEMEXP is supposed to have UTF-8 encoding...

"UTF-8 is the byte-oriented encoding form of Unicode."
http://www.unicode.org/faq/utf_bom.html#2

In other words, all strings on the server are stored in the UTF-8 byte
encoding of the Unicode encoding. All exchanges between the server and
the clients are done in UTF-8, a byte convention to represent Unicode
characters.

After that, each platform is free to do any appropriate conversion,
e.g. use 16 or 32 bits character encoding if they will. However, you
should take care to set the default Python encoding to UTF-8 when you
dialogue with the server.

To be honest, right now, the server does not check much this encoding.
It mainly came from the GTK2 interface that produces UTF-8 strings.
:-) But that should be done at one point.

Best wishes,
d.


_______________________________________________
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev

Re: [Demexp-dev] Character encoding

Répondre à