RE: Encoding <--> Internal Representation [was inaccurately Re: Encoding conversions]

Carl W. Brown Tue, 11 Sep 2001 06:59:54 -0700
Michael,

External representation - UTF-8

If you want and encoding to send to a different machine then the best choice
is UTF-8.  Most code pages limit your character set so there is no single
solution and the other forms of Unicode have big endian/little endian
problems.  One an RS6000 running AIX an 'A' in UTF-16 is \x00\x41 in UTF-32
it is \x00\x00\x00\x41.  On an Intel system the UTF-16 is \x41\x00 and
UTF-32 is \x41\x00\x00\x00.  If you use either of these encodings then you
may have to flip the bytes in the string to make the data usable.  With
UTF-8 it is the same on all systems.

Internal representation - Depends on the system

If you use the native OS functions then the Unicode support depends on the
OS most but not all systems have either a 4 byte or a 2 byte wchar_t and
functions to convert UTF-8 to wchar_t.  You have to use the wide character
variants of the string handling functions.  Many systems also support UTF-8
as a MBCS (Multi-Byte Character Set).  This is because most functions that
support MBCS data will work with UTF-8 data by just changing the character
length calculation routine that changes with each different character set
(encoding).

The other way to go is to use some add on code.  I like ICU because it works
on everything from AS400 to Mac.  The problem however is that it only
supports UTF-16 and it has a more Java like API.  That is not surprising
because it is essential the Java Unicode support that has been greatly
enhanced and make available for C/C++.  This is why I have added extra code
xIUA http://www.xnetinc.com/xiua/ which give you familiar APIs that look
more like the ones that you see in the C library.  strstr is xiua_strstr and
strlen is xiua_strlen.  This code is much more Linux friendly because it
supports both UTF-8 and UTF-32 which is what Linux uses for wchar_t.

It even lets you mix and match encodings in the same application.  For
example you can have a browser that is using Shift_JIS, HTML pages in
EUC-JP, a data base using UTF-16 and you can be communicating with a data
server using UTF-8.  The next request can come in in a completely different
set of encodings and the same code can handle it.

This is all free open source code so that you can tailor it to your needs.

Carl



> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Michael B. Allen
> Sent: Tuesday, September 11, 2001 12:26 AM
> To: Bruno Haible; [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Encoding <--> Internal Representation [was inaccurately Re:
> Encoding conversions]
>
>
> > Also, I would forget about wchar_t. Nobody uses that. 'char*' is
> > better than 'wchar_t*': both are locale dependent, but the 'char*'
> > strings can be more easily communicated to stdout.
>
> Bruno,
>
> Hold on. Reset please. I'm NOT trying to normalize to an specific
> _encoding_. I just want to normalize on one particular string
> _representation_ that has all the string manipluation routines to go with
> it (e.g. strstr, strlen, printf).  If I serialize a string as ecoding X
> and send it to a different machine, as long as I read it out as encoding
> X it doesn't matter how that string is represented in memory. All I
> need to know is that it is. I will let iconv take care of figuring out
> how to convert whatever the serialized encoding was to the internal
> representation.  I am not going to serialize strings as wchar_t. If
> wchar_t characters are UCS with or without the __STDC_ISO_10646__ macro
> on one machine and rot13 mixed with locales and OS dependancies on another
> I don't see how that has anything to do with serialization functions.
>
> I don't understand why people are telling me to use particular encodings
> over another and look for the stdc macro, etc. Please help. I'm hopelessly
> confused to the point of just giving up on this whole project and finding
> another project to work on.
>
> Mike
>
> --
> Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
RE: Encoding <--> Internal Representation [was inaccurately Re: Encoding conversions]

Reply via email to