Hi Terry and all,
I usually just lurk on the list, but since I'm a C++ afficionado, I wanted to question your below snipped statement.
If we settle on wchar_t being 16bits, then we will still be forced to do UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string, since we must deal with that charming thing known as "surrogate pairs" (see section 3.7 of the Unicode standard v3.0). This again breaks the "one wchar_t == on character". When being forced to deal with Unicode, I much prefer working with 32bits, since that guarantees that I get a fixed length for each character. Admittedly, it is space inefficient to the Nth degree, but speedwise it is better.
As for interoperability with Windows, it is clearly stated that the wchar_t is intended for internal usage only, and the various encoding schemes should be used when storing strings outside of a process. In reality this means that just about every Unicode capable application reads and writes in UTF-8 or 7. This means that interoperability should not become an issue. If it really was expected to have been an issue, I'm sure the C++ standard would have mandated a specific width for wchar_t, which as far as I am aware they didn't. The draft copy I pulled out via google says the following:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (_lib.locale_). Type wchar_t shall have the same size, signedness, and alignment requirements (_intro.memory_) as one of the other integral types, called its underlying type.
So, in the light of this, what would be the most appropriate choice? I haven't yet had a chance to explore what locales we support, but I would lean toward saying wchar_t == 32 bits, since this is future proof. If we later down the track are forced to go from 16 -> 32 due us supporting more of the asian locales, I foresee this causing _major_ breakage.
If anyone actually has a copy of the C++ standard and would be kind enough to paste the section regarding the size of wchar_t, that would be most helpful for this discussion I believe.
Johny Mattsson | Email: [EMAIL PROTECTED]
Ericsson Support Engineer | Phone: +61 (0)3 9301 1372
NCSA NetScreen Certified | Mobile: +61 (0)404 003 713
From: Terry Lambert [SMTP:[EMAIL PROTECTED]]
Sent: Tuesday, June 18, 2002 9:47 PM
To: Thomas David Rivers
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: PATCH: wchar_t is already defined in libstd++
o A desire for raw storage of Unicode, rather than UTF-8 or
UTF-7 encoding. This last one is:
o UTF encoding breaks fixed field storage, which has
always bean a measure of the number of characters
you can put in a field.