Michael,
External representation - UTF-8
If you want and encoding to send to a different machine then the best choice
is UTF-8. Most code pages limit your character set so there is no single
solution and the other forms of Unicode have big endian/little endian
problems. One an RS6000 running AIX an 'A' in UTF-16 is \x00\x41 in UTF-32
it is \x00\x00\x00\x41. On an Intel system the UTF-16 is \x41\x00 and
UTF-32 is \x41\x00\x00\x00. If you use either of these encodings then you
may have to flip the bytes in the string to make the data usable. With
UTF-8 it is the same on all systems.
Internal representation - Depends on the system
If you use the native OS functions then the Unicode support depends on the
OS most but not all systems have either a 4 byte or a 2 byte wchar_t and
functions to convert UTF-8 to wchar_t. You have to use the wide character
variants of the string handling functions. Many systems also support UTF-8
as a MBCS (Multi-Byte Character Set). This is because most functions that
support MBCS data will work with UTF-8 data by just changing the character
length calculation routine that changes with each different character set
(encoding).
The other way to go is to use some add on code. I like ICU because it works
on everything from AS400 to Mac. The problem however is that it only
supports UTF-16 and it has a more Java like API. That is not surprising
because it is essential the Java Unicode support that has been greatly
enhanced and make available for C/C++. This is why I have added extra code
xIUA http://www.xnetinc.com/xiua/ which give you familiar APIs that look
more like the ones that you see in the C library. strstr is xiua_strstr and
strlen is xiua_strlen. This code is much more Linux friendly because it
supports both UTF-8 and UTF-32 which is what Linux uses for wchar_t.
It even lets you mix and match encodings in the same application. For
example you can have a browser that is using Shift_JIS, HTML pages in
EUC-JP, a data base using UTF-16 and you can be communicating with a data
server using UTF-8. The next request can come in in a completely different
set of encodings and the same code can handle it.
This is all free open source code so that you can tailor it to your needs.
Carl
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Michael B. Allen
> Sent: Tuesday, September 11, 2001 12:26 AM
> To: Bruno Haible; [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Encoding <--> Internal Representation [was inaccurately Re:
> Encoding conversions]
>
>
> > Also, I would forget about wchar_t. Nobody uses that. 'char*' is
> > better than 'wchar_t*': both are locale dependent, but the 'char*'
> > strings can be more easily communicated to stdout.
>
> Bruno,
>
> Hold on. Reset please. I'm NOT trying to normalize to an specific
> _encoding_. I just want to normalize on one particular string
> _representation_ that has all the string manipluation routines to go with
> it (e.g. strstr, strlen, printf). If I serialize a string as ecoding X
> and send it to a different machine, as long as I read it out as encoding
> X it doesn't matter how that string is represented in memory. All I
> need to know is that it is. I will let iconv take care of figuring out
> how to convert whatever the serialized encoding was to the internal
> representation. I am not going to serialize strings as wchar_t. If
> wchar_t characters are UCS with or without the __STDC_ISO_10646__ macro
> on one machine and rot13 mixed with locales and OS dependancies on another
> I don't see how that has anything to do with serialization functions.
>
> I don't understand why people are telling me to use particular encodings
> over another and look for the stdc macro, etc. Please help. I'm hopelessly
> confused to the point of just giving up on this whole project and finding
> another project to work on.
>
> Mike
>
> --
> Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
> -
> Linux-UTF8: i18n of Linux on all levels
> Archive: http://mail.nl.linux.org/linux-utf8/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/