On 25/04/06, Karsten Ohme <[EMAIL PROTECTED]> wrote:
> Ludovic Rousseau wrote:
> > I don't know why MS chose to use UTF-16 instead of UTF-8. UTF-8 is
> > backward compatible with ASCII so (very) easy to migrate to.
>
> For the most languages this would make trouble. E.g. Asian languages use
> two bytes. So independent of the locale the programmer can allocate two
> bytes (actually a TCHAR) (if UNICODE is defined). With UTF-8 you must
> parse the string (get the string length) to get the real physical size
> of the string, because ASCII is coded on the seven lower bits and the
> MSB decides about a next byte to get a whole character. I assume this is
> a reason so that it seams to be simpler.

You should also need to parse the string to get its real length even
with UTF-16. According to [1] you may code some unicode characters on
4 bytes. So just dividing the array length by 2 to get the string
length may not work. If you want an simple transformation you should
use UTF-32 since any unicode character can be represented on exactly
32 bits. But I am not a unicode expert.

GTK+ 2.x uses UTF-8 only and proposes a function g_utf8_strlen [2] to
get the string length.

> Java also uses UTF-16.

Maybe not a good example? :-)

Thanks

[1] http://en.wikipedia.org/wiki/UTF-16
[2] 
http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strlen

--
  Dr Ludovic Rousseau

_______________________________________________
Muscle mailing list
[email protected]
http://lists.drizzle.com/mailman/listinfo/muscle

Reply via email to