Re: How do I use Xerces strings?

David Bertoni Wed, 08 Mar 2006 22:22:47 -0800

Steven T. Hatton wrote:

On Wednesday 08 March 2006 02:18, Scott Cantor wrote:

IIRC, there /are/ different UTF encodings, even within UTF-16.
There is something called UCS-4, and also something called UCS-2 (I
believe). I do not know the difference between these and their related
UTF-32 and UTF-16.

Nor I, but that's what I had in mind when I expressed caution.

To my mind, the failure to specify a UTF-16 string class is one of the worstaspects of C++.

That would require that C++ define some integral character type that isencoded in UTF-16. It's unlikely that every compiler vendor would agreeto do that, although it would certainly make implementing software thatsupports Unicode much easier.

After reading the applicable sections of ISO/IEC 14882:2003,I have come to the conclusion that the Xerces XMLCh is not defined in such away as to conform to the definition of a C++ implementation's extendedcharacter set.

XMLCh is defined to hold UTF-16 code units, which is a much stricterrequirement than anything the C++ standard says about character sets.

In oder to implement the C++ extended character set, membersof the C++ basic character set (ASCII character set) should be defined aswchar_t using their wide character literals. That is, for example:
typedef wchar_t XMLCh;

const XMLCh chLatin_A               = L'A';
const XMLCh chLatin_B               = L'B';
const XMLCh chLatin_C               = L'C';
const XMLCh chLatin_D               = L'D';

Rather than:

typedef unsigned short XMLCh;

const XMLCh chLatin_A               = 0x41;
const XMLCh chLatin_B               = 0x42;
const XMLCh chLatin_C               = 0x43;
const XMLCh chLatin_D               = 0x44;

You are making the assumption that the basic character set must beencoded in ASCII, but the C++ standard makes no such requirement.

There may be reasons the Xerces developers chose to implement UTF-16 withoutconforming to the requirements for implementing the C++ extended characterset. I guess, technically speaking, the encoding of UTF-16 and the extendedcharacter set will not, in general, coincide.

I'm not sure I understand what you're saying. Xerces-C encodescharacter data in UTF-16, and to do that, it uses a 16-bit integral. Itcannot use wchar_t to hold UTF-16 code units, because there is noguarantee that a particular C++ implementation will encode wchar_t inUTF-16. In fact, there is no requirement that wchar_t even be a 16-bitintegral

That is, there is no requirement that the ASCII character set be
encoded using ASCII values. In such a case, then the numerical value
of chLatin_A would not be the  same in all implementations.

Well, I would hope an ASCII character would be encoded in ASCII. ;-)Perhaps what you really meant was that there is no requirement that thebasic character set be encoded in ASCII.


Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How do I use Xerces strings?

Reply via email to