[snip]
The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the 16-bit range.
A Unicode character has a code point, which is a scalar value in the range U+0000 to U+10FFFF. The code point for every character in the Unicode character set will fall in this range.
There are Unicode encoding schemes, which specify how Unicode code point values are serialized. Examples include UTF-8, UTF-16LE, UTF-16BE, UTF-32, UTF-7, etc.
The UTF-16 (big or little endian) encoding scheme uses two code units (16-bit values) to encode Unicode characters with code point values > U+0FFFF.
According to Marvin's explanations, the Unicode standard requires these characters to be represented as "the one" codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that character.
Since the Unicode code point range is constrained to U+0000...U+10FFFF, the longest valid UTF-8 sequence is 4 bytes.
But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit range cannot be represented as chars. That is, the in-memory-representation still requires the use of the surrogate pairs. Therefore, writing consists of translating the surrogate pair to the >16bit representation of the same character and then algorithmically encoding that. Reading is exactly the reverse process.
Yes. Writing requires that you combine the two surrogate characters into a Unicode code point, then converting that value into the UTF-8 4 byte sequence.
Adding code to handle the 4 to 6 byte encodings to the readChars/writeChars method is simple, but how do you do the mapping from surrogate pairs to the chars they represent? Is there an algorithm for doing that except for table lookups or huge switch statements?
It's easy, since U+D800...U+DBFF is defined as the range for the high (most significant) surrogate, and U+DC00...U+DFFF is defined as the range for the low (least significant) surrogate.
-- Ken -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]