On Sun, Sep 09, 2001 at 05:31:27PM +0200, Bruno Haible wrote:
> > > You cannot even assume that. wchar_t is locale dependent and
> > > OS/compiler/vendor dependent. It should never be used for "binary file
> > > formats and network messages".
> >
> > Well, I have to normalize to something!
>
> wchar_t is a very wrong thing to normalize to, because it is OS and
> locale dependent. UTF-8 is a much better normalization for strings,
> both in-memory and on disk. UCS-4 is an alternative, good
> normalization for strings in memory.
Well, then what's it good for?
Maybe we misunderstand each other. Perhaps if I tell you exactly what
I'm trying to do you can just tell me how I should do it?
I want to encode and decode binary data from sockets and files
(streams). Because serializing and deserializing integers and strings is
fundamental to these problems I have written a very light weight peice
of code designed specifically to abstract this functionality. I have
placed the work at the URL below if you care to examine it:
http://auditorymodels.org/encdec/
I would like the code to be as general as possible. For one project
(an SMB/CIFS server) I will be decoding and encoding many UCS-2LE (or
UTF-16LE, not sure) strings. Another intrest of mine is the MS Word file
binary format which has a slew of different string types potentially
mixed into the same document.
I thought that normalizing strings to wchar_t was the wise choice because
I could take advanage of the existing string manipulation functions
like wcslen, wcsstr, etc (Actually, I believe someone on this list
instructed my to use wchar_t regarding a similar question). But now I
should use UTF-8?
In light of the detail above, can you tell me what the ideal solution
to this problem would be?
Mike
--
Wow a memory-mapped fork bomb! Now what on earth did you expect? - lkml
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/