Are you talking about unpaired surrogates or something else?
Thanks, Masayoshi On 1/24/2011 5:22 AM, Tom Christiansen wrote:
I am somewhat uncertain, but I believe that Java *almost* meets this requirement. 1.7 Code Points A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. RL1.7 Supplementary Code Points To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching. Java tries to make things work this way, and always does so on well-formed input. The reason I say almost is because of the way the regex engine will sometimes match individual code units on ill-formed UTF-16 sequences. I believe this behaviour to be contrary to the fundamental requirement for Level 1 compliance that Unicode text never be interpreted as code units. Fortunately, this does not seem too difficult to fix, though. --tom