Sven Bauhan wrote:
This is not true.  std::string and UTF-8 are fully compatible, as long as
you make no assumptions about chopping things up at arbitrary indices, or
the relationship of Unicode code points and UTF-8 code units.  At any rate,
with a double-byte or multi-byte locale code page, you'd have the same
issues.

I do not really understand what you want to say here. As far as I know std::string stores strings in single byte units. In UTF-8 the units have variable length between 1 and 4 bytes. So I cannot see a match here. I thought to use UTF-8 with the STL you need something like std::basic_string<UTFChar>.
You are confusing code points and code units. The size of a code unit in UTF-8 is an octet (8 bits, or one byte on most architectures). The number of octets required to encode a particular Unicode code point in UTF-8 is 1, 2, 3, or 4. If you ignore architectures where a byte stores more than 8 bits, you can then assume that an octet and a byte are interchangeable.

UTF-8 was designed to be compatible with the char data type, and null-terminated arrays of UTF-8 code units are compatible with many C/C++ runtime functions that accept C-style strings. The problems start when you rely on locale-specific behavior, or you make assumptions about the relationship of code points and code units. For example, a substring operation could be problematic if I split a multi-byte UTF-8 sequence. Another example is code that relies on functions like isdigit, which are sensitive to the locale and/or the system default encoding for char. In that case, UTF-8 bytes might be mistakenly interpreted as code points in the system encoding.


Could you tell me, how to transcode the XMLChar* correctly using UTF-8?
You call the transcoding service and create a UTF-8 transcoder. There is a code snippet in another thread that's on-going, with the subject "Converting XMLCh* to std::string with encoding."

Dave

Reply via email to