Sven Bauhan wrote:
This is not true. std::string and UTF-8 are fully compatible, as long as
you make no assumptions about chopping things up at arbitrary indices, or
the relationship of Unicode code points and UTF-8 code units. At any rate,
with a double-byte or multi-byte locale code page, you'd have the same
issues.
I do not really understand what you want to say here. As far as I know
std::string stores strings in single byte units. In UTF-8 the units have
variable length between 1 and 4 bytes. So I cannot see a match here.
I thought to use UTF-8 with the STL you need something like
std::basic_string<UTFChar>.
You are confusing code points and code units. The size of a code unit in
UTF-8 is an octet (8 bits, or one byte on most architectures). The number
of octets required to encode a particular Unicode code point in UTF-8 is 1,
2, 3, or 4. If you ignore architectures where a byte stores more than 8
bits, you can then assume that an octet and a byte are interchangeable.
UTF-8 was designed to be compatible with the char data type, and
null-terminated arrays of UTF-8 code units are compatible with many C/C++
runtime functions that accept C-style strings. The problems start when you
rely on locale-specific behavior, or you make assumptions about the
relationship of code points and code units. For example, a substring
operation could be problematic if I split a multi-byte UTF-8 sequence.
Another example is code that relies on functions like isdigit, which are
sensitive to the locale and/or the system default encoding for char. In
that case, UTF-8 bytes might be mistakenly interpreted as code points in
the system encoding.
Could you tell me, how to transcode the XMLChar* correctly using UTF-8?
You call the transcoding service and create a UTF-8 transcoder. There is a
code snippet in another thread that's on-going, with the subject
"Converting XMLCh* to std::string with encoding."
Dave