Jimmy Kaplowitz writes: > based on looking at man pages, you can use one of three > functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte > string to a wide character string (an array of type wchar_t, one wchar_t > per *character*), and then use the many wcs* functions to do various > tests. My recollection of the consensus on this list is that for > internal purposes, wchar_t is the way to go, and conversion to multibyte > strings of char is necessary only for I/O, and there only when you can't > use functions like fwprintf.
That was my impression at the beginning as well. Until I realized that all this idea leads to are unreliable programs. Because fgetwc, which you would like to use for I/O, doesn't give you any chance of correction when it encounters an invalid multibyte character in the input file. And the output side of the streams are not better: fputwc on a stream on which someone has already done an fputc call is undefined behaviour (it can crash or do nothing). For an example, take the 'rev' program, in the util-linux, and feed it with ISO-8859-1 input while running in an UTF-8 locale. Simply unreliable. Also wchar_t[] occupies more memory. More memory means more cache misses, means less speed. Also wchar_t[] doesn't fulfill its promise of "1 character = 1 memory unit". Because a Vietnamese character is usually composed from two Unicode characters; the term "complex character" is used to denote this multi-wchar_t unit. And you cannot separate these two units, neither in truncation, regexp search, linebreaking or whatever algorithm. For this reason, wchar_t is only good to call <wctype.h> libc APIs, not for in-memory representation of strings. The latter should still be done with char*. And for iterating through characters in multibyte strings, you can use the inline functions found at http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbchar.h?rev=1.3&content-type=text/vnd.viewcvs-markup http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbiter_multi.h?rev=1.3&content-type=text/vnd.viewcvs-markup http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbfile_multi.h?rev=1.3&content-type=text/vnd.viewcvs-markup > However, wchar_t is only guaranteed to be Unicode (which encoding?) > when the macro __STDC_ISO_10646__ is defined, as is done with glibc 2.2. Correct. But it does not mean that *every* Unicode character can be used: You cannot use Hangul Unicode characters in an ISO-8859-1 locale. In glibc the <wctype.h> functions work on these characters (in any locale, except the "C" locale), but when you convert a Hangul character to multibyte in such a locale, all you get is a '?'. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
