Steven T. Hatton wrote:
On Thursday 09 March 2006 12:16, David Bertoni wrote:
Steven T. Hatton wrote:

wchar_t is 32 bits on my system.  I believe that a 16 bit storage unit
will under normal circumstances occupy a 32 bit memory location, but only
use half of it.
Yes, and don't you think that's rather wasteful?  Would you use Xerces-C
to process large XML documents if you knew it was wasting half of its
character string storage just so it could use wchar_t on all platforms?

Actually, I did not state my intended meaning well, and I have now come to understand that I was in error. I was thinking in terms of individual units of storage, i.e., individual characters as opposed to containers. Containers (at least sequential containers) are basically arrays under the hood, so they do store data contiguously. I believe an individual 16-bit XMLCh will occupy 32-bits of storage, but that is probably a fairly rare animal, and therefore not worth consideration.

I guess I don't understand what you mean by "I believe an individual 16-bit XMLCh will occupy 32-bits of storage." How can a 16-bit XMLCh ever occupy 32 bits of storage?

Why does Xerces-C use a non-standard data type?
unsigned short is not a non-standard type.  You may think it's
"non-standard" for holding character data, but Xerces-C encodes
character data in UTF-16 code units, and that requires a 16-bit integral
type.

It is (AFAIK) not one of the datatypes supported by my Standard Library implementation. That is my point. I cannot seamlessly use it with the facilities provided by the C++ Standard Library.

I agree it's a big problem that you cannot use it with std::basic_string, but there's no reason why you can't use it with the the other containers. What other facilities do you want to use?


If my implementation doesn't support a particular locale, and

 > therefore does not use a 16 bit or larger data type, then what are the
 > chances that I would use Xerces-C to support such a character set?

You've got it backwards -- Xerces-C only support the current locale's
character set in a very limited fashion, by providing a way to transcode
UTF-16 strings to character strings in the current locale.  Otherwise,
it operates internally exclusively in UTF-16, and it is unaffected by
the current locale or how the system encodes char or wchar_t.

According to the standard, the C++ implementation must use a wchar_t large enough to hold all the characters used by that local. Combining that requirement with the requirement that implementation needs to support the character literals of the extended character set using the naming specified by ISO/IEC 10646:2000, I conclude that the requirement is virtually identical to the requirement that it support UTF. But I won't go so far as to say UTF-16.


UTF-16 is an encoding of the 10646/Unicode character set, and you've stated previously that the C++ standard does not talk about encodings:

> The C++ Standard only specifies character sets.  It does not specify
> encodings.

There is no requirement that a character specified with a universal character name be encoded in any particular way -- it's just another way to name a character.

My version of the standard also has this to say:

"If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is ill-formed."

That restricts the usage of universal character names too severely for Xerces-C's purposes.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to