Re: Unicode support under Linux

Glenn Maynard Wed, 03 Oct 2001 20:28:57 -0700

> * strchr
>   does not works at all, because UTF-8 character cannot be expressed
>   with 'char' type.


>From my understanding of UTF-8, strchr() should work fine to search for
7-bit characters, and strstr()/strrstr() for searching for arbitrary
Unicode characters (searching for them in UTF-8.)

Major speed hit, of course, but there's no way to fix that within UTF-8
(and even using wchar_t is a major speed hit, just due to memory usage.)

> I think the simplest way to substitute all these functions is to use
> wide character.  Standard C library has wchar_t substitution of above
> functions.  And, these are conversion functions between "multibyte
> character" and "wide character".  Note that "multibyte character" does
> not mean the character is always multibyte.  It is "locale-dependent
> encoding".  This means that, in ISO-8859-1 locale, "multibyte character"
> is ISO-8859-1.  In Big5 locale, "multibyte character" is Big5.  I.e.,
> if you write your software using "multibyte character" and "wide
> character",

Well, that means either a major memory hit for string-intensive
programs (using wchar_t internally exclusively) or a lot of conversion
(using multibyte internally); both imply a speed hit (above the
expected.)  Both imply a lot of converting (the first, whenever you
read or write to disk, files, for filenames, etc; the second, every
time you call a wide C function.)  Using both WC and MB internally
is rather annoying, too (nobody sane wants to deal with more than one
string type.)

All C string functions can be implemented easily for UTF-8; the only
hard part is doing it efficiently, and without converting the whole
thing to wchar_t first.  Some functions are straightforward to
implement reasonably fast, but you're always stuck with the UTF-8
decoding logic ...

Is gdb yet smart enough to convert wchar_t * to the locale when displaying
strings?  I doubt it; this probably makes using wchar_t internally
harder to debug.

It also implies not taking advantage of some of the better aspects of
UTF-8, like being able to do a strrchr() and strrstr() without having
to scan from the beginning of the string.

Of course, supporting arbitrary encodings is nice, but I wouldn't
want to complicate a program too badly for it.  (That's from my "don't
go out of your way to support obsolete software" perspective, of course--
all those other annoying encodings being the obsolete software--but 
it's not always that simple.)

-- 
Glenn Maynard
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode support under Linux

Reply via email to