Ludovic Rousseau wrote: > On 25/04/06, Karsten Ohme <[EMAIL PROTECTED]> wrote: > >>Ludovic Rousseau wrote: >> >>>I don't know why MS chose to use UTF-16 instead of UTF-8. UTF-8 is >>>backward compatible with ASCII so (very) easy to migrate to. >> >>For the most languages this would make trouble. E.g. Asian languages use >>two bytes. So independent of the locale the programmer can allocate two >>bytes (actually a TCHAR) (if UNICODE is defined). With UTF-8 you must >>parse the string (get the string length) to get the real physical size >>of the string, because ASCII is coded on the seven lower bits and the >>MSB decides about a next byte to get a whole character. I assume this is >>a reason so that it seams to be simpler. > > > You should also need to parse the string to get its real length even > with UTF-16. According to [1] you may code some unicode characters on > 4 bytes. So just dividing the array length by 2 to get the string > length may not work. If you want an simple transformation you should > use UTF-32 since any unicode character can be represented on exactly > 32 bits. But I am not a unicode expert.
Mmmh, if you allocate memory for Unicode string always two bytes are used, so Microsoft uses UCS-2 (see google search for Microsoft UCS-2) Plane 0: Basic Multilingual Plane (BMP), where this two bytes are fixed to have a concrete allocation value. Java uses UTF-16. Karsten > > GTK+ 2.x uses UTF-8 only and proposes a function g_utf8_strlen [2] to > get the string length. > > >>Java also uses UTF-16. > > > Maybe not a good example? :-) > > Thanks > > [1] http://en.wikipedia.org/wiki/UTF-16 > [2] > http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strlen > > -- > Dr Ludovic Rousseau > > _______________________________________________ > Muscle mailing list > [email protected] > http://lists.drizzle.com/mailman/listinfo/muscle _______________________________________________ Muscle mailing list [email protected] http://lists.drizzle.com/mailman/listinfo/muscle
