Re: Unicode Normalization

Yonik Seeley Wed, 11 Apr 2007 20:02:55 -0700

On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

Unicode characters do not map
precisely to code points:  a single character can often be represented
via a single codepoint or a combination of two (surrogate pair).


I normally hear surrogates in the context of UTF-16 after the code point space
became too large for UTF-16 to represent.  AFAIK it's more of an
encoding thing, not a code point thing... for example, you would never
see the surrogates if you encoded in UTF8 (although the surrogates are
still code points since they needed to be reserved).

But there do seem to be groups of code points that map to a single character:
http://en.wikipedia.org/wiki/Combining_character

have no idea how java's String class handles this--I doubt it does any
intelligent normalization.


UTF-16 surrogates are handled as of Java5.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unicode Normalization

Reply via email to