On 25/04/06, Karsten Ohme <[EMAIL PROTECTED]> wrote: > Ludovic Rousseau wrote: > > I don't know why MS chose to use UTF-16 instead of UTF-8. UTF-8 is > > backward compatible with ASCII so (very) easy to migrate to. > > For the most languages this would make trouble. E.g. Asian languages use > two bytes. So independent of the locale the programmer can allocate two > bytes (actually a TCHAR) (if UNICODE is defined). With UTF-8 you must > parse the string (get the string length) to get the real physical size > of the string, because ASCII is coded on the seven lower bits and the > MSB decides about a next byte to get a whole character. I assume this is > a reason so that it seams to be simpler.
You should also need to parse the string to get its real length even with UTF-16. According to [1] you may code some unicode characters on 4 bytes. So just dividing the array length by 2 to get the string length may not work. If you want an simple transformation you should use UTF-32 since any unicode character can be represented on exactly 32 bits. But I am not a unicode expert. GTK+ 2.x uses UTF-8 only and proposes a function g_utf8_strlen [2] to get the string length. > Java also uses UTF-16. Maybe not a good example? :-) Thanks [1] http://en.wikipedia.org/wiki/UTF-16 [2] http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strlen -- Dr Ludovic Rousseau _______________________________________________ Muscle mailing list [email protected] http://lists.drizzle.com/mailman/listinfo/muscle
