On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
Unicode characters do not map precisely to code points: a single character can often be represented via a single codepoint or a combination of two (surrogate pair).
I normally hear surrogates in the context of UTF-16 after the code point space became too large for UTF-16 to represent. AFAIK it's more of an encoding thing, not a code point thing... for example, you would never see the surrogates if you encoded in UTF8 (although the surrogates are still code points since they needed to be reserved). But there do seem to be groups of code points that map to a single character: http://en.wikipedia.org/wiki/Combining_character
have no idea how java's String class handles this--I doubt it does any intelligent normalization.
UTF-16 surrogates are handled as of Java5. -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]