Re: How do I use Xerces strings?

David Bertoni Thu, 09 Mar 2006 11:15:17 -0800

Steven T. Hatton wrote:

On Thursday 09 March 2006 12:16, David Bertoni wrote:
Steven T. Hatton wrote:
wchar_t is 32 bits on my system.  I believe that a 16 bit storage unit
will under normal circumstances occupy a 32 bit memory location, but only
use half of it.
Yes, and don't you think that's rather wasteful?  Would you use Xerces-C
to process large XML documents if you knew it was wasting half of its
character string storage just so it could use wchar_t on all platforms?
Actually, I did not state my intended meaning well, and I have now come tounderstand that I was in error. I was thinking in terms of individual unitsof storage, i.e., individual characters as opposed to containers. Containers(at least sequential containers) are basically arrays under the hood, so theydo store data contiguously. I believe an individual 16-bit XMLCh will occupy32-bits of storage, but that is probably a fairly rare animal, and thereforenot worth consideration.

I guess I don't understand what you mean by "I believe an individual16-bit XMLCh will occupy 32-bits of storage." How can a 16-bit XMLChever occupy 32 bits of storage?

Why does Xerces-C use a non-standard data type?
unsigned short is not a non-standard type.  You may think it's
"non-standard" for holding character data, but Xerces-C encodes
character data in UTF-16 code units, and that requires a 16-bit integral
type.
It is (AFAIK) not one of the datatypes supported by my Standard Libraryimplementation. That is my point. I cannot seamlessly use it with thefacilities provided by the C++ Standard Library.

I agree it's a big problem that you cannot use it withstd::basic_string, but there's no reason why you can't use it with thethe other containers. What other facilities do you want to use?

If my implementation doesn't support a particular locale, and
 > therefore does not use a 16 bit or larger data type, then what are the
 > chances that I would use Xerces-C to support such a character set?

You've got it backwards -- Xerces-C only support the current locale's
character set in a very limited fashion, by providing a way to transcode
UTF-16 strings to character strings in the current locale.  Otherwise,
it operates internally exclusively in UTF-16, and it is unaffected by
the current locale or how the system encodes char or wchar_t.
According to the standard, the C++ implementation must use a wchar_t largeenough to hold all the characters used by that local. Combining thatrequirement with the requirement that implementation needs to support thecharacter literals of the extended character set using the naming specifiedby ISO/IEC 10646:2000, I conclude that the requirement is virtually identicalto the requirement that it support UTF. But I won't go so far as to sayUTF-16.

UTF-16 is an encoding of the 10646/Unicode character set, and you'vestated previously that the C++ standard does not talk about encodings:


> The C++ Standard only specifies character sets.  It does not specify
> encodings.

There is no requirement that a character specified with a universalcharacter name be encoded in any particular way -- it's just another wayto name a character.


My version of the standard also has this to say:

"If the hexadecimal value for a universal character name is less than0x20 or in the range 0x7F-0x9F (inclusive), or if the universalcharacter name designates a character in the basic source character set,then the program is ill-formed."

That restricts the usage of universal character names too severely forXerces-C's purposes.


Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How do I use Xerces strings?

Reply via email to