The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.
My proposal interprets the resulting code points in the following ways: 1) In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category. 2) When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD. 3) In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD. I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything. Norbert On Mar 26, 2012, at 15:10 , Glenn Adams wrote: > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <[email protected]> > wrote: > I really like the direction you're going in, but have one minor concern > relating to regular expressions. > > In your proposal, you currently state: > "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of > a surrogate pair, is interpreted as a code point with the same value." > > Just as a reminder, this would be in explicit violation of the Unicode > conformance clause C1 unless it can be guaranteed that such a code point will > not be interpreted as an abstract character: > > C1 A process shall not interpret a high-surrogate code point or a > low-surrogate code point as an abstract character. > > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf > > Given that such guarantee is likely impractical, this presents a problem for > the above proposed language. _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

