Are you talking about unpaired surrogates or something else?

Thanks,
Masayoshi

On 1/24/2011 5:22 AM, Tom Christiansen wrote:
I am somewhat uncertain, but I believe that Java
*almost* meets this requirement.

     1.7 Code Points

     A fundamental requirement is that Unicode text be interpreted
     semantically by code point, not code units.

     RL1.7      Supplementary Code Points

         To meet this requirement, an implementation shall handle the full
         range of Unicode code points, including values from U+FFFF to
         U+10FFFF. In particular, where UTF-16 is used, a sequence
         consisting of a leading surrogate followed by a trailing surrogate
         shall be handled as a single code point in matching.

Java tries to make things work this way, and always does so on well-formed
input.  The reason I say almost is because of the way the regex engine will
sometimes match individual code units on ill-formed UTF-16 sequences.  I
believe this behaviour to be contrary to the fundamental requirement for
Level 1 compliance that Unicode text never be interpreted as code units.

Fortunately, this does not seem too difficult to fix, though.

--tom

Reply via email to