Re: mbscmp

Bruno Haible Mon, 25 Feb 2002 12:14:19 -0800

Jimmy Kaplowitz writes:
> based on looking at man pages, you can use one of three
> functions (mbstowcs, mbsrtowcs, or mbsnrtowcs) to convert your multibyte
> string to a wide character string (an array of type wchar_t, one wchar_t
> per *character*), and then use the many wcs* functions to do various
> tests. My recollection of the consensus on this list is that for
> internal purposes, wchar_t is the way to go, and conversion to multibyte
> strings of char is necessary only for I/O, and there only when you can't
> use functions like fwprintf.

That was my impression at the beginning as well. Until I realized that
all this idea leads to are unreliable programs. Because fgetwc, which
you would like to use for I/O, doesn't give you any chance of
correction when it encounters an invalid multibyte character in the
input file. And the output side of the streams are not better: fputwc
on a stream on which someone has already done an fputc call is
undefined behaviour (it can crash or do nothing).

For an example, take the 'rev' program, in the util-linux, and feed it
with ISO-8859-1 input while running in an UTF-8 locale. Simply
unreliable.

Also wchar_t[] occupies more memory. More memory means more cache
misses, means less speed.

Also wchar_t[] doesn't fulfill its promise of "1 character = 1 memory
unit". Because a Vietnamese character is usually composed from two
Unicode characters; the term "complex character" is used to denote
this multi-wchar_t unit. And you cannot separate these two units,
neither in truncation, regexp search, linebreaking or whatever
algorithm.

For this reason, wchar_t is only good to call <wctype.h> libc APIs,
not for in-memory representation of strings. The latter should still
be done with char*. And for iterating through characters in multibyte
strings, you can use the inline functions found at

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbchar.h?rev=1.3&content-type=text/vnd.viewcvs-markup

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbiter_multi.h?rev=1.3&content-type=text/vnd.viewcvs-markup

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gnu-i18n/gnu-i18n/fileutils-i18n/lib/mbfile_multi.h?rev=1.3&content-type=text/vnd.viewcvs-markup

> However, wchar_t is only guaranteed to be Unicode (which encoding?)
> when the macro __STDC_ISO_10646__ is defined, as is done with glibc 2.2.

Correct. But it does not mean that *every* Unicode character can be
used: You cannot use Hangul Unicode characters in an ISO-8859-1
locale. In glibc the <wctype.h> functions work on these characters (in
any locale, except the "C" locale), but when you convert a Hangul
character to multibyte in such a locale, all you get is a '?'.

Bruno
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Re: mbscmp

Reply via email to