Roger Leigh wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
A while back, I made the useful discovery that GCC accepts UTF-8
encoded C source by default, and in the generated object code uses
UTF-8 for narrow (char) strings, and UTF-32/UCS-4 for wide (wchar_t)
strings.
As an example:
#include <locale.h>
#include <stdio.h>
int
main (void)
{
setlocale (LC_ALL, "");
printf("‘Name’\n");
return 0;
}
This then correctly outputs the quotes:
$ ./test
‘Name’
A better example is here:
http://groups-beta.google.com/group/comp.lang.c.moderated/msg/bb55bb9f835eba6a?hl=en
In this case, you can output wide strings to narrow streams, and
narrow strings to wide streams. In order to be able to do this, I
assume that the C runtime must know something of the execution
charsets in order to do the conversion, otherwise you wouldn't get
readable output. Additionally, when you output a wide string with
wprintf(), it must be recoded to the narrow representation for
output??.
The above link is wrong. I thought that given the C runtime's
knowledge of the execution charsets, it would recode the output into
the locale charset. This does not appear to be the case, however.
The above program works the same in the C locale as a normal UTF-8
locale.
Can anyone confirm if the above is correct, or point to anywhere this
is documented?
Googling for "gcc utf-8" brings up a discussion from this list (Dec
2004) which references the GCC documentation.
The archive of that discussion starts at
http://mail.nl.linux.org/linux-utf8/2004-11/index.html#00008
GCC documentation is available online at
http://gcc.gnu.org/onlinedocs/
The behaviour of the compiler regarding Unicode strings can be
controlled with preprocessor options.
The page for this is
http://gcc.gnu.org/onlinedocs/gcc-4.0.0/gcc/Preprocessor-Options.html#Preprocessor-Options
Hope this helps,
Simos
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/