Re: Unicode vs. wchar and MultiByte

Glenn Maynard Thu, 13 Dec 2001 16:12:06 -0800

On Thu, Dec 13, 2001 at 05:55:48PM -0100, Bengt Johansson wrote:
> I have an application that internally uses Unicode. The application


Do you mean UTF-8?  (And does it really use Unicode, or does it use the
locale encoding?  If the former, you need to be careful whenever doing
any sort of I/O to convert as necessary, so the latter is usually
better.)

> I don't know much about this, so I was hoping that the wide character
> stuff would work with my Unicode strings - and infact it did, with a
> German installation of Linux, but on an English installation it stopped
> working.

Setting the locale correctly?

> After reading about the wide character functions I realized that they
> are locale dependent. But my program is not. The Unicode strings in my
> program, may contain any Unicode characters, no matter what the locale
> is.

>From http://www.cl.cam.ac.uk/~mgk25/unicode.html:

"C support for Unicode and UTF-8

Starting with GNU glibc 2.2, the type wchar_t is officially intended to
be used only for 32-bit ISO 10646 values, independent of the currently
used locale. This is signalled to applications by the definition of the
__STDC_ISO_10646__ macro as required by ISO C99. The ISO C multi-byte
conversion functions (mbsrtowcs(), wcsrtombs(), etc.) are fully
implemented in glibc 2.2 or higher and can be used to convert between
wchar_t and any locale-dependent multibyte encoding, including UTF-8,
ISO 8859-1, etc."

> Does anybode have any suggestions as to what format I should use when
> communicating with extern libraries like the Gdk libraries, or even the
> stdlib and its string functions? It seems to me that wide character
> would be the right solution, but on the English installation these calls
> crashes as soon as the (32-bit) character code is larger than 255.
> 
> Is there a standard way to map arbitary Unicode characters to wide
> character without taking the locale into account?

First, though, I'd find out whether the char * versions of the GTK calls
honor the locale.  If they don't, complain to them loudly; they should.
If you really, really need to use wchar versions, you're probably better
off requiring __STDC_ISO_10646__.  (That's probably reasonable for GTK
apps, but I'm no advocate of supporting obsolete compilers, so you'd be
well off to get other opinions.)

If the GTK functions do honor the locale (and, since they probably use C
functions, they probably do to some degree), you're much better off
using them.  Debugging wchar-based programs is a real pain.

As to GTK crashing, that'd be a bug (whether it honors the locale
explicitely or not), so I'd report it.  (If you're setting the locale
and it's not expecting you to, that could cause this.)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode vs. wchar and MultiByte

Reply via email to