ok, i'll accept your position at this point and drop my comment; i suppose it is true that if there are already unpaired surrogates in user data as UTF-16, then having unpaired surrogates as code points is no worse;
however, it would be useful if there were an informative pointer from the spec under consideration to a UTC sanctioned list of operations that constitute "interpreting as abstract characters" and, that, if used on such data would possibly violate C1; to this end, it would be useful if C1 itself included a concrete example of such an operation On Tue, Mar 27, 2012 at 2:02 PM, Mark Davis ☕ <[email protected]> wrote: > >performing a predicate on that code point, such as described in D21 > (e.g., IsAlphabetic) would entail interpreting it as an abstract character? > No. > > > but where does one draw the line? > The line is already drawn by the Unicode consortium, by consulting the > Unicode > Character Database properties. If you look at the data in the Unicode > Character Database for any particular property, say Alphabetic, you'll find > that surrogate code points are not included where the property is a true > character property. There are a few special cases where reserved code > points are provisionally given "anticipatory" character properties, such as > in bidi ranges, simply because that makes implementations is more forward > compatible, but there aren't any cases where a "character" property applies > to a surrogate code point (other than by returning "No", or "n/a", or some > such). > > ------------------------------ > Mark <https://plus.google.com/114199149796022210033> > * > * > *— Il meglio è l’inimico del bene —* > ** > > > > On Tue, Mar 27, 2012 at 12:07, Glenn Adams <[email protected]> wrote: > >> So, if as a result of a policy of converting any UTF-16 code unit >> sequence to a code point sequence one ends up with an unpaired surrogate, >> e.g., "\u{00DC00}", then performing a predicate on that code point, such as >> described in D21 (e.g., IsAlphabetic) would entail interpreting it as an >> abstract character? >> >> I can see that D20 defines code point properties which would not entail >> interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter, >> but where does one draw the line? >> >> >> On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <[email protected]>wrote: >> >>> The point of C1 is that you can't interpret the surrogate code point >>> U+DC00 as a *character*, like an "a". >>> >>> Neither can you interpret the reserved code point U+0378 as a >>> *character*, like a "b". >>> >>> >>> ------------------------------ >>> Mark <https://plus.google.com/114199149796022210033> >>> * >>> * >>> *— Il meglio è l’inimico del bene —* >>> ** >>> >>> >>> >>> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <[email protected]> wrote: >>> >>>> This begs the question of what is the point of C1. >>>> >>>> >>>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]>wrote: >>>> >>>>> That would not be practical, nor predictable. And note that the 700K >>>>> reserved code points are also not to be interpreted as characters; by your >>>>> logic all of them would need to be converted to FFFD. >>>>> >>>>> And in practice, an unpaired surrogate is best treated just like a >>>>> reserved (unassigned) code point. For example, a lowercase operation >>>>> should >>>>> convert characters with lowercase correspondants to those correspondants, >>>>> and leave *everything* else alone: control characters, format characters, >>>>> reserved code points, surrogates, etc. >>>>> >>>>> ------------------------------ >>>>> Mark <https://plus.google.com/114199149796022210033> >>>>> * >>>>> * >>>>> *— Il meglio è l’inimico del bene —* >>>>> ** >>>>> >>>>> >>>>> >>>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]>wrote: >>>>>> >>>>>>> That, as Norbert explained, is not the intention of the standard. >>>>>>> Take a look at the discussion of "Unicode 16-bit string" in chapter 3. >>>>>>> The >>>>>>> committee recognized that fragments may be formed when working with >>>>>>> UTF-16, >>>>>>> and that destructive changes may do more harm than good. >>>>>>> >>>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length()); >>>>>>> y = x.substring(0, 5) + x.substring(6, x.length()); >>>>>>> >>>>>>> After this operation is done, you want y == a, even if 5 is between >>>>>>> D800 and DC00. >>>>>>> >>>>>> >>>>>> Assuming that b.length() == 1 in this example, my interpretation of >>>>>> this is that '=', '+', and 'substring' are operations whose domain and >>>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16 >>>>>> code units. Since none of these operations entail interpreting the >>>>>> semantics of a code point (i.e., interpreting abstract characters), then >>>>>> there is no violation of C1 here. >>>>>> >>>>>> Or take: >>>>>>> output = ""; >>>>>>> for (int i = 0; i < s.length(); ++i) { >>>>>>> ch = s.charAt(i); >>>>>>> if (ch.equals('&')) { >>>>>>> ch = '@'; >>>>>>> } >>>>>>> output += ch; >>>>>>> } >>>>>>> >>>>>>> After this operation is done, you want "a&\u{10000}b" to become >>>>>>> "a@\u{10000}b", >>>>>>> not "a&\u{FFFD}\u{FFFD}b". >>>>>>> It is also an unnecessary burden on lower-level software to always >>>>>>> check this stuff. >>>>>>> >>>>>> >>>>>> Again, in this example, I assume that the string literal >>>>>> "a&\u{10000}b" maps to the UTF-16 code unit sequence: >>>>>> >>>>>> 0061 0026 D800 DC00 0062 >>>>>> >>>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and >>>>>> not code points, and since the 'equals' operator is also defined on code >>>>>> units, this example also does not require interpreting the semantics of >>>>>> code points (i.e., interpreting abstract characters). >>>>>> >>>>>> However, in Norbert's questions above about isUUppercase(int) and >>>>>> toUpperCase(int), it is clear that the domain of these operations are >>>>>> code >>>>>> points, not code units, and further, that they requiring interpretation >>>>>> as >>>>>> abstract characters in order to determine the semantics of the >>>>>> corresponding characters. >>>>>> >>>>>> My conclusion is that the determination of whether C1 is violated or >>>>>> not depends upon the domain, codomain, and operation being considered. >>>>>> >>>>>> >>>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage >>>>>>> or output, then you do need to either convert to FFFD or take some other >>>>>>> action. >>>>>>> >>>>>>> ------------------------------ >>>>>>> Mark <https://plus.google.com/114199149796022210033> >>>>>>> * >>>>>>> * >>>>>>> *— Il meglio è l’inimico del bene —* >>>>>>> ** >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> The conformance clause doesn't say anything about the >>>>>>>>> interpretation of (UTF-16) code units as code points. To check >>>>>>>>> conformance >>>>>>>>> with C1, you have to look at how the resulting code points are >>>>>>>>> actually >>>>>>>>> further interpreted. >>>>>>>>> >>>>>>>> >>>>>>>> True, but if the proposed language >>>>>>>> >>>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part >>>>>>>> of a surrogate pair, is interpreted as a code point with the same >>>>>>>> value." >>>>>>>> >>>>>>>> is adopted, then will not this have an effect of creating unpaired >>>>>>>> surrogates as code points? If so, then by my estimation, this *will >>>>>>>> * increase the likelihood of their being interpreted as abstract >>>>>>>> characters... e.g., if the unpaired code unit is interpreted as a >>>>>>>> unpaired >>>>>>>> surrogate code point, and some process/function performs *any* >>>>>>>> predicate >>>>>>>> or transform on that code point, then that amounts to interpreting it >>>>>>>> as an >>>>>>>> abstract character. >>>>>>>> >>>>>>>> I would rather see such unpaired code unit either (1) be mapped to >>>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that >>>>>>>> requires conversion of the UTF-16 code unit sequence. >>>>>>>> >>>>>>>> >>>>>>>>> My proposal interprets the resulting code points in the following >>>>>>>>> ways: >>>>>>>>> >>>>>>>>> 1) In regular expressions, they can be used in both patterns and >>>>>>>>> input strings to be matched. They may be compared against other code >>>>>>>>> points, or against character classes, some of which will hopefully >>>>>>>>> soon be >>>>>>>>> defined by Unicode properties. In the case of comparing against other >>>>>>>>> code >>>>>>>>> points, they can't match any code points assigned to abstract >>>>>>>>> characters. >>>>>>>>> In the case of Unicode properties, they'll typically fall into the >>>>>>>>> large >>>>>>>>> bucket of have-nots, along with other unassigned code points or, for >>>>>>>>> example, U+FFFD, unless you ask for their general category. >>>>>>>>> >>>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or >>>>>>>>> ID_Continue properties, so they'll be excluded, just like other >>>>>>>>> unassigned >>>>>>>>> code points or U+FFFD. >>>>>>>>> >>>>>>>>> 3) In case conversion, they won't have upper case or lower case >>>>>>>>> equivalents defined, and remain as is, as would happen for unassigned >>>>>>>>> code >>>>>>>>> points or U+FFFD. >>>>>>>>> >>>>>>>>> I don't think either of these amount to interpretation as abstract >>>>>>>>> characters. I mention U+FFFD because the alternative interpretation of >>>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that >>>>>>>>> doesn't >>>>>>>>> seem to improve anything. >>>>>>>>> >>>>>>>>> Norbert >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote: >>>>>>>>> >>>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > I really like the direction you're going in, but have one minor >>>>>>>>> concern relating to regular expressions. >>>>>>>>> > >>>>>>>>> > In your proposal, you currently state: >>>>>>>>> > "A code unit that is in the range 0xD800 to 0xDFFF, but >>>>>>>>> is not part of a surrogate pair, is interpreted as a code point with >>>>>>>>> the >>>>>>>>> same value." >>>>>>>>> > >>>>>>>>> > Just as a reminder, this would be in explicit violation of the >>>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a >>>>>>>>> code >>>>>>>>> point will not be interpreted as an abstract character: >>>>>>>>> > >>>>>>>>> > C1 A process shall not interpret a high-surrogate code point >>>>>>>>> or a low-surrogate code point as an abstract character. >>>>>>>>> > >>>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf >>>>>>>>> > >>>>>>>>> > Given that such guarantee is likely impractical, this presents a >>>>>>>>> problem for the above proposed language. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> es-discuss mailing list >>>>>>>> [email protected] >>>>>>>> https://mail.mozilla.org/listinfo/es-discuss >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

