The point of C1 is that you can't interpret the surrogate code point U+DC00 as a *character*, like an "a".
Neither can you interpret the reserved code point U+0378 as a *character*, like a "b". ------------------------------ Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:56, Glenn Adams <[email protected]> wrote: > This begs the question of what is the point of C1. > > > On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]> wrote: > >> That would not be practical, nor predictable. And note that the 700K >> reserved code points are also not to be interpreted as characters; by your >> logic all of them would need to be converted to FFFD. >> >> And in practice, an unpaired surrogate is best treated just like a >> reserved (unassigned) code point. For example, a lowercase operation should >> convert characters with lowercase correspondants to those correspondants, >> and leave *everything* else alone: control characters, format characters, >> reserved code points, surrogates, etc. >> >> ------------------------------ >> Mark <https://plus.google.com/114199149796022210033> >> * >> * >> *— Il meglio è l’inimico del bene —* >> ** >> >> >> >> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote: >> >>> >>> >>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]>wrote: >>> >>>> That, as Norbert explained, is not the intention of the standard. Take >>>> a look at the discussion of "Unicode 16-bit string" in chapter 3. The >>>> committee recognized that fragments may be formed when working with UTF-16, >>>> and that destructive changes may do more harm than good. >>>> >>>> x = a.substring(0, 5) + b + a.substring(5, a.length()); >>>> y = x.substring(0, 5) + x.substring(6, x.length()); >>>> >>>> After this operation is done, you want y == a, even if 5 is between >>>> D800 and DC00. >>>> >>> >>> Assuming that b.length() == 1 in this example, my interpretation of this >>> is that '=', '+', and 'substring' are operations whose domain and co-domain >>> are (currently defined) ES Strings, namely sequences of UTF-16 code units. >>> Since none of these operations entail interpreting the semantics of a code >>> point (i.e., interpreting abstract characters), then there is no violation >>> of C1 here. >>> >>> Or take: >>>> output = ""; >>>> for (int i = 0; i < s.length(); ++i) { >>>> ch = s.charAt(i); >>>> if (ch.equals('&')) { >>>> ch = '@'; >>>> } >>>> output += ch; >>>> } >>>> >>>> After this operation is done, you want "a&\u{10000}b" to become >>>> "a@\u{10000}b", >>>> not "a&\u{FFFD}\u{FFFD}b". >>>> It is also an unnecessary burden on lower-level software to always >>>> check this stuff. >>>> >>> >>> Again, in this example, I assume that the string literal "a&\u{10000}b" >>> maps to the UTF-16 code unit sequence: >>> >>> 0061 0026 D800 DC00 0062 >>> >>> Given that 'charAt(i)' is defined on (and is indexing) code units and >>> not code points, and since the 'equals' operator is also defined on code >>> units, this example also does not require interpreting the semantics of >>> code points (i.e., interpreting abstract characters). >>> >>> However, in Norbert's questions above about isUUppercase(int) and >>> toUpperCase(int), it is clear that the domain of these operations are code >>> points, not code units, and further, that they requiring interpretation as >>> abstract characters in order to determine the semantics of the >>> corresponding characters. >>> >>> My conclusion is that the determination of whether C1 is violated or not >>> depends upon the domain, codomain, and operation being considered. >>> >>> >>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or >>>> output, then you do need to either convert to FFFD or take some other >>>> action. >>>> >>>> ------------------------------ >>>> Mark <https://plus.google.com/114199149796022210033> >>>> * >>>> * >>>> *— Il meglio è l’inimico del bene —* >>>> ** >>>> >>>> >>>> >>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote: >>>> >>>>> >>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < >>>>> [email protected]> wrote: >>>>> >>>>>> The conformance clause doesn't say anything about the interpretation >>>>>> of (UTF-16) code units as code points. To check conformance with C1, you >>>>>> have to look at how the resulting code points are actually further >>>>>> interpreted. >>>>>> >>>>> >>>>> True, but if the proposed language >>>>> >>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of >>>>> a surrogate pair, is interpreted as a code point with the same value." >>>>> >>>>> is adopted, then will not this have an effect of creating unpaired >>>>> surrogates as code points? If so, then by my estimation, this *will* >>>>> increase >>>>> the likelihood of their being interpreted as abstract characters... e.g., >>>>> if the unpaired code unit is interpreted as a unpaired surrogate code >>>>> point, and some process/function performs *any* predicate or >>>>> transform on that code point, then that amounts to interpreting it as an >>>>> abstract character. >>>>> >>>>> I would rather see such unpaired code unit either (1) be mapped to >>>>> U+00FFFD, or (2) an exception raised when performing an operation that >>>>> requires conversion of the UTF-16 code unit sequence. >>>>> >>>>> >>>>>> My proposal interprets the resulting code points in the following >>>>>> ways: >>>>>> >>>>>> 1) In regular expressions, they can be used in both patterns and >>>>>> input strings to be matched. They may be compared against other code >>>>>> points, or against character classes, some of which will hopefully soon >>>>>> be >>>>>> defined by Unicode properties. In the case of comparing against other >>>>>> code >>>>>> points, they can't match any code points assigned to abstract characters. >>>>>> In the case of Unicode properties, they'll typically fall into the large >>>>>> bucket of have-nots, along with other unassigned code points or, for >>>>>> example, U+FFFD, unless you ask for their general category. >>>>>> >>>>>> 2) When parsing identifiers, they will not have the ID_Start or >>>>>> ID_Continue properties, so they'll be excluded, just like other >>>>>> unassigned >>>>>> code points or U+FFFD. >>>>>> >>>>>> 3) In case conversion, they won't have upper case or lower case >>>>>> equivalents defined, and remain as is, as would happen for unassigned >>>>>> code >>>>>> points or U+FFFD. >>>>>> >>>>>> I don't think either of these amount to interpretation as abstract >>>>>> characters. I mention U+FFFD because the alternative interpretation of >>>>>> unpaired surrogates would be to replace them with U+FFFD, but that >>>>>> doesn't >>>>>> seem to improve anything. >>>>>> >>>>>> Norbert >>>>>> >>>>>> >>>>>> >>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote: >>>>>> >>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough < >>>>>> [email protected]> wrote: >>>>>> > I really like the direction you're going in, but have one minor >>>>>> concern relating to regular expressions. >>>>>> > >>>>>> > In your proposal, you currently state: >>>>>> > "A code unit that is in the range 0xD800 to 0xDFFF, but is >>>>>> not part of a surrogate pair, is interpreted as a code point with the >>>>>> same >>>>>> value." >>>>>> > >>>>>> > Just as a reminder, this would be in explicit violation of the >>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a >>>>>> code >>>>>> point will not be interpreted as an abstract character: >>>>>> > >>>>>> > C1 A process shall not interpret a high-surrogate code point or >>>>>> a low-surrogate code point as an abstract character. >>>>>> > >>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf >>>>>> > >>>>>> > Given that such guarantee is likely impractical, this presents a >>>>>> problem for the above proposed language. >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> es-discuss mailing list >>>>> [email protected] >>>>> https://mail.mozilla.org/listinfo/es-discuss >>>>> >>>>> >>>> >>> >> >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

