Re: How to parse using DOM

David Bertoni Wed, 28 Nov 2007 16:08:53 -0800

Sven Bauhan wrote:

This is not true.  std::string and UTF-8 are fully compatible, as long as
you make no assumptions about chopping things up at arbitrary indices, or
the relationship of Unicode code points and UTF-8 code units.  At any rate,
with a double-byte or multi-byte locale code page, you'd have the same
issues.
I do not really understand what you want to say here. As far as I knowstd::string stores strings in single byte units. In UTF-8 the units havevariable length between 1 and 4 bytes. So I cannot see a match here.I thought to use UTF-8 with the STL you need something likestd::basic_string<UTFChar>.

You are confusing code points and code units. The size of a code unit inUTF-8 is an octet (8 bits, or one byte on most architectures). The numberof octets required to encode a particular Unicode code point in UTF-8 is 1,2, 3, or 4. If you ignore architectures where a byte stores more than 8bits, you can then assume that an octet and a byte are interchangeable.

UTF-8 was designed to be compatible with the char data type, andnull-terminated arrays of UTF-8 code units are compatible with many C/C++runtime functions that accept C-style strings. The problems start when yourely on locale-specific behavior, or you make assumptions about therelationship of code points and code units. For example, a substringoperation could be problematic if I split a multi-byte UTF-8 sequence.Another example is code that relies on functions like isdigit, which aresensitive to the locale and/or the system default encoding for char. Inthat case, UTF-8 bytes might be mistakenly interpreted as code points inthe system encoding.


Could you tell me, how to transcode the XMLChar* correctly using UTF-8?

You call the transcoding service and create a UTF-8 transcoder. There is acode snippet in another thread that's on-going, with the subject"Converting XMLCh* to std::string with encoding."


Dave

Re: How to parse using DOM

Reply via email to