Hi,

At Wed, 03 Oct 2001 15:45:31 -0400,
Richard, Francois M <[EMAIL PROTECTED]> wrote:

> But, is it also true to say that under Linux utf-8 Locales, all C functions
> handle properly char data representing utf-8 character encoded data? Do
> strlen, strchr, strcmp, strcpy, toupper process char data correctly when the
> Locale character encoding is utf-8? OR I need to use the wide character
> functions after specific conversion from char to wchar_t of my charatcer
> data?

Not perfectly.

* strlen
  strlen counts the *number of bytes* of the given string, not the
  *number of characters* of the string.  Since UTF-8 is a multibyte
  encoding, these two does not coincide.

* strcpy
  works well.

* strchr
  does not works at all, because UTF-8 character cannot be expressed
  with 'char' type.

I think the simplest way to substitute all these functions is to use
wide character.  Standard C library has wchar_t substitution of above
functions.  And, these are conversion functions between "multibyte
character" and "wide character".  Note that "multibyte character" does
not mean the character is always multibyte.  It is "locale-dependent
encoding".  This means that, in ISO-8859-1 locale, "multibyte character"
is ISO-8859-1.  In Big5 locale, "multibyte character" is Big5.  I.e.,
if you write your software using "multibyte character" and "wide character",
your software will support not only UTF-8 but also all major encodings
in the world such as ISO-8859-*, EUC-*, KOI8-*, and so on.

Explanation on wchar_t functions is available at my document
available from my signature at the bottom of this mail.

Note that wchar_t is not always UTF-32, though it is always true in
GNU libc.  If you have to write portable software, you must not assume
wchar_t is UTF-32.

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to