Re: Encoding conversions

Bruno Haible Fri, 07 Sep 2001 12:53:43 -0700
Michael B. Allen writes:

> But it's not clear to me how this should be done correctly
> and in a portable way (or at least portable enough so that when if comes
> time to port I don't smack myself in the forehead).

Use iconv. I mean the libc's iconv on GNU libc systems, and the
libiconv (also by GNU, but a different implementation) on other
systems. libiconv is ported to most systems.

> I gather that I can only assume that wchar_t is just a sequence of UCS
> codes of sizeof(wchar_t) in size.

You cannot even assume that. wchar_t is locale dependent and
OS/compiler/vendor dependent. It should never be used for "binary file
formats and network messages".

> But is the in memory representation
> of a multi-byte string the equivalent of the UTF-8 encoding

Depends where you got the string. In most cases, like when you got it
from fgets(stdin), it will be in locale dependent encoding (LC_CTYPE
environment variable dependent). Only in particular cases, like
filenames read from 'pax' archives, or when you yourself converted it
to UTF-8, or when you use a GNOME 2 API function, will the string be
in UTF-8.

> So as an example case, to encode wchar_t to UTF-16LE I must convert each
> character to a definative encoding such as UCS-4 and then use iconv
> to get to UTF-16LE.

With the two aforementioned iconv implementations, you can also
directly use  iconv_open("UTF-16LE","wchar_t").

> PS: When encoding ASCII do I want to shave off the 8th bit?

Removing the 8th bit is a "garbage in - garbage out" technique and
causes endless grief to users. Instead call iconv_open(...,"ASCII"),
and you'll get full error checking if a non-ASCII character is
encountered.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: Encoding conversions

Reply via email to