> * strchr > does not works at all, because UTF-8 character cannot be expressed > with 'char' type.
>From my understanding of UTF-8, strchr() should work fine to search for 7-bit characters, and strstr()/strrstr() for searching for arbitrary Unicode characters (searching for them in UTF-8.) Major speed hit, of course, but there's no way to fix that within UTF-8 (and even using wchar_t is a major speed hit, just due to memory usage.) > I think the simplest way to substitute all these functions is to use > wide character. Standard C library has wchar_t substitution of above > functions. And, these are conversion functions between "multibyte > character" and "wide character". Note that "multibyte character" does > not mean the character is always multibyte. It is "locale-dependent > encoding". This means that, in ISO-8859-1 locale, "multibyte character" > is ISO-8859-1. In Big5 locale, "multibyte character" is Big5. I.e., > if you write your software using "multibyte character" and "wide > character", Well, that means either a major memory hit for string-intensive programs (using wchar_t internally exclusively) or a lot of conversion (using multibyte internally); both imply a speed hit (above the expected.) Both imply a lot of converting (the first, whenever you read or write to disk, files, for filenames, etc; the second, every time you call a wide C function.) Using both WC and MB internally is rather annoying, too (nobody sane wants to deal with more than one string type.) All C string functions can be implemented easily for UTF-8; the only hard part is doing it efficiently, and without converting the whole thing to wchar_t first. Some functions are straightforward to implement reasonably fast, but you're always stuck with the UTF-8 decoding logic ... Is gdb yet smart enough to convert wchar_t * to the locale when displaying strings? I doubt it; this probably makes using wchar_t internally harder to debug. It also implies not taking advantage of some of the better aspects of UTF-8, like being able to do a strrchr() and strrstr() without having to scan from the beginning of the string. Of course, supporting arbitrary encodings is nice, but I wouldn't want to complicate a program too badly for it. (That's from my "don't go out of your way to support obsolete software" perspective, of course-- all those other annoying encodings being the obsolete software--but it's not always that simple.) -- Glenn Maynard - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
