Hi, At Wed, 03 Oct 2001 15:45:31 -0400, Richard, Francois M <[EMAIL PROTECTED]> wrote:
> But, is it also true to say that under Linux utf-8 Locales, all C functions > handle properly char data representing utf-8 character encoded data? Do > strlen, strchr, strcmp, strcpy, toupper process char data correctly when the > Locale character encoding is utf-8? OR I need to use the wide character > functions after specific conversion from char to wchar_t of my charatcer > data? Not perfectly. * strlen strlen counts the *number of bytes* of the given string, not the *number of characters* of the string. Since UTF-8 is a multibyte encoding, these two does not coincide. * strcpy works well. * strchr does not works at all, because UTF-8 character cannot be expressed with 'char' type. I think the simplest way to substitute all these functions is to use wide character. Standard C library has wchar_t substitution of above functions. And, these are conversion functions between "multibyte character" and "wide character". Note that "multibyte character" does not mean the character is always multibyte. It is "locale-dependent encoding". This means that, in ISO-8859-1 locale, "multibyte character" is ISO-8859-1. In Big5 locale, "multibyte character" is Big5. I.e., if you write your software using "multibyte character" and "wide character", your software will support not only UTF-8 but also all major encodings in the world such as ISO-8859-*, EUC-*, KOI8-*, and so on. Explanation on wchar_t functions is available at my document available from my signature at the bottom of this mail. Note that wchar_t is not always UTF-32, though it is always true in GNU libc. If you have to write portable software, you must not assume wchar_t is UTF-32. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/ "Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/ - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
