The conformance clause doesn't say anything about the interpretation of 
(UTF-16) code units as code points. To check conformance with C1, you have to 
look at how the resulting code points are actually further interpreted.

My proposal interprets the resulting code points in the following ways:

1) In regular expressions, they can be used in both patterns and input strings 
to be matched. They may be compared against other code points, or against 
character classes, some of which will hopefully soon be defined by Unicode 
properties. In the case of comparing against other code points, they can't 
match any code points assigned to abstract characters. In the case of Unicode 
properties, they'll typically fall into the large bucket of have-nots, along 
with other unassigned code points or, for example, U+FFFD, unless you ask for 
their general category.

2) When parsing identifiers, they will not have the ID_Start or ID_Continue 
properties, so they'll be excluded, just like other unassigned code points or 
U+FFFD.

3) In case conversion, they won't have upper case or lower case equivalents 
defined, and remain as is, as would happen for unassigned code points or U+FFFD.

I don't think either of these amount to interpretation as abstract characters. 
I mention U+FFFD because the alternative interpretation of unpaired surrogates 
would be to replace them with U+FFFD, but that doesn't seem to improve anything.

Norbert



On Mar 26, 2012, at 15:10 , Glenn Adams wrote:

> On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <[email protected]> 
> wrote:
> I really like the direction you're going in, but have one minor concern 
> relating to regular expressions.
> 
> In your proposal, you currently state:
>        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
> a surrogate pair, is interpreted as a code point with the same value."
> 
> Just as a reminder, this would be in explicit violation of the Unicode 
> conformance clause C1 unless it can be guaranteed that such a code point will 
> not be interpreted as an abstract character:
> 
> C1    A process shall not interpret a high-surrogate code point or a 
> low-surrogate code point as an abstract character.
> 
> [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf 
> 
> Given that such guarantee is likely impractical, this presents a problem for 
> the above proposed language.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to