On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < [email protected]> wrote:
> The conformance clause doesn't say anything about the interpretation of > (UTF-16) code units as code points. To check conformance with C1, you have > to look at how the resulting code points are actually further interpreted. > True, but if the proposed language "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value." is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this *will* increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs *any* predicate or transform on that code point, then that amounts to interpreting it as an abstract character. I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence. > My proposal interprets the resulting code points in the following ways: > > 1) In regular expressions, they can be used in both patterns and input > strings to be matched. They may be compared against other code points, or > against character classes, some of which will hopefully soon be defined by > Unicode properties. In the case of comparing against other code points, > they can't match any code points assigned to abstract characters. In the > case of Unicode properties, they'll typically fall into the large bucket of > have-nots, along with other unassigned code points or, for example, U+FFFD, > unless you ask for their general category. > > 2) When parsing identifiers, they will not have the ID_Start or > ID_Continue properties, so they'll be excluded, just like other unassigned > code points or U+FFFD. > > 3) In case conversion, they won't have upper case or lower case > equivalents defined, and remain as is, as would happen for unassigned code > points or U+FFFD. > > I don't think either of these amount to interpretation as abstract > characters. I mention U+FFFD because the alternative interpretation of > unpaired surrogates would be to replace them with U+FFFD, but that doesn't > seem to improve anything. > > Norbert > > > > On Mar 26, 2012, at 15:10 , Glenn Adams wrote: > > > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough < > [email protected]> wrote: > > I really like the direction you're going in, but have one minor concern > relating to regular expressions. > > > > In your proposal, you currently state: > > "A code unit that is in the range 0xD800 to 0xDFFF, but is not > part of a surrogate pair, is interpreted as a code point with the same > value." > > > > Just as a reminder, this would be in explicit violation of the Unicode > conformance clause C1 unless it can be guaranteed that such a code point > will not be interpreted as an abstract character: > > > > C1 A process shall not interpret a high-surrogate code point or a > low-surrogate code point as an abstract character. > > > > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf > > > > Given that such guarantee is likely impractical, this presents a problem > for the above proposed language. > >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

