Re: Encoding <--> Internal Representation [was inaccurately Re: Encoding conversions]

Michael B. Allen Tue, 11 Sep 2001 07:27:49 -0700
I'm sorry. I have a larger problem right now.

Thanks for all your help guys.

Mike

On Tue, Sep 11, 2001 at 07:27:47AM -0700, Carl W. Brown wrote:
> Michael,
> 
> External representation - UTF-8
> 
> If you want and encoding to send to a different machine then the best choice
> is UTF-8.  Most code pages limit your character set so there is no single
> solution and the other forms of Unicode have big endian/little endian
> problems.  One an RS6000 running AIX an 'A' in UTF-16 is \x00\x41 in UTF-32
> it is \x00\x00\x00\x41.  On an Intel system the UTF-16 is \x41\x00 and
> UTF-32 is \x41\x00\x00\x00.  If you use either of these encodings then you
> may have to flip the bytes in the string to make the data usable.  With
> UTF-8 it is the same on all systems.
> 
> Internal representation - Depends on the system
> 
> If you use the native OS functions then the Unicode support depends on the
> OS most but not all systems have either a 4 byte or a 2 byte wchar_t and
> functions to convert UTF-8 to wchar_t.  You have to use the wide character
> variants of the string handling functions.  Many systems also support UTF-8
> as a MBCS (Multi-Byte Character Set).  This is because most functions that
> support MBCS data will work with UTF-8 data by just changing the character
> length calculation routine that changes with each different character set
> (encoding).
> 
> The other way to go is to use some add on code.  I like ICU because it works
> on everything from AS400 to Mac.  The problem however is that it only
> supports UTF-16 and it has a more Java like API.  That is not surprising
> because it is essential the Java Unicode support that has been greatly
> enhanced and make available for C/C++.  This is why I have added extra code
> xIUA http://www.xnetinc.com/xiua/ which give you familiar APIs that look
> more like the ones that you see in the C library.  strstr is xiua_strstr and
> strlen is xiua_strlen.  This code is much more Linux friendly because it
> supports both UTF-8 and UTF-32 which is what Linux uses for wchar_t.
> 
> It even lets you mix and match encodings in the same application.  For
> example you can have a browser that is using Shift_JIS, HTML pages in
> EUC-JP, a data base using UTF-16 and you can be communicating with a data
> server using UTF-8.  The next request can come in in a completely different
> set of encodings and the same code can handle it.
> 
> This is all free open source code so that you can tailor it to your needs.
> 
> Carl
> 
> 
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED]]On Behalf Of Michael B. Allen
> > Sent: Tuesday, September 11, 2001 12:26 AM
> > To: Bruno Haible; [EMAIL PROTECTED]
> > Cc: [EMAIL PROTECTED]
> > Subject: Encoding <--> Internal Representation [was inaccurately Re:
> > Encoding conversions]
> >
> >
> > > Also, I would forget about wchar_t. Nobody uses that. 'char*' is
> > > better than 'wchar_t*': both are locale dependent, but the 'char*'
> > > strings can be more easily communicated to stdout.
> >
> > Bruno,
> >
> > Hold on. Reset please. I'm NOT trying to normalize to an specific
> > _encoding_. I just want to normalize on one particular string
> > _representation_ that has all the string manipluation routines to go with
> > it (e.g. strstr, strlen, printf).  If I serialize a string as ecoding X
> > and send it to a different machine, as long as I read it out as encoding
> > X it doesn't matter how that string is represented in memory. All I
> > need to know is that it is. I will let iconv take care of figuring out
> > how to convert whatever the serialized encoding was to the internal
> > representation.  I am not going to serialize strings as wchar_t. If
> > wchar_t characters are UCS with or without the __STDC_ISO_10646__ macro
> > on one machine and rot13 mixed with locales and OS dependancies on another
> > I don't see how that has anything to do with serialization functions.
> >
> > I don't understand why people are telling me to use particular encodings
> > over another and look for the stdc macro, etc. Please help. I'm hopelessly
> > confused to the point of just giving up on this whole project and finding
> > another project to work on.
> >
> > Mike
> >
> > --
> > Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
> > -
> > Linux-UTF8:   i18n of Linux on all levels
> > Archive:      http://mail.nl.linux.org/linux-utf8/
> 
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/

-- 
Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: Encoding <--> Internal Representation [was inaccurately Re: Encoding conversions]

Reply via email to