This begs the question of what is the point of C1. On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]> wrote:
> That would not be practical, nor predictable. And note that the 700K > reserved code points are also not to be interpreted as characters; by your > logic all of them would need to be converted to FFFD. > > And in practice, an unpaired surrogate is best treated just like a > reserved (unassigned) code point. For example, a lowercase operation should > convert characters with lowercase correspondants to those correspondants, > and leave *everything* else alone: control characters, format characters, > reserved code points, surrogates, etc. > > ------------------------------ > Mark <https://plus.google.com/114199149796022210033> > * > * > *— Il meglio è l’inimico del bene —* > ** > > > > On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote: > >> >> >> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]> wrote: >> >>> That, as Norbert explained, is not the intention of the standard. Take a >>> look at the discussion of "Unicode 16-bit string" in chapter 3. The >>> committee recognized that fragments may be formed when working with UTF-16, >>> and that destructive changes may do more harm than good. >>> >>> x = a.substring(0, 5) + b + a.substring(5, a.length()); >>> y = x.substring(0, 5) + x.substring(6, x.length()); >>> >>> After this operation is done, you want y == a, even if 5 is between D800 >>> and DC00. >>> >> >> Assuming that b.length() == 1 in this example, my interpretation of this >> is that '=', '+', and 'substring' are operations whose domain and co-domain >> are (currently defined) ES Strings, namely sequences of UTF-16 code units. >> Since none of these operations entail interpreting the semantics of a code >> point (i.e., interpreting abstract characters), then there is no violation >> of C1 here. >> >> Or take: >>> output = ""; >>> for (int i = 0; i < s.length(); ++i) { >>> ch = s.charAt(i); >>> if (ch.equals('&')) { >>> ch = '@'; >>> } >>> output += ch; >>> } >>> >>> After this operation is done, you want "a&\u{10000}b" to become >>> "a@\u{10000}b", >>> not "a&\u{FFFD}\u{FFFD}b". >>> It is also an unnecessary burden on lower-level software to always check >>> this stuff. >>> >> >> Again, in this example, I assume that the string literal "a&\u{10000}b" >> maps to the UTF-16 code unit sequence: >> >> 0061 0026 D800 DC00 0062 >> >> Given that 'charAt(i)' is defined on (and is indexing) code units and not >> code points, and since the 'equals' operator is also defined on code units, >> this example also does not require interpreting the semantics of code >> points (i.e., interpreting abstract characters). >> >> However, in Norbert's questions above about isUUppercase(int) and >> toUpperCase(int), it is clear that the domain of these operations are code >> points, not code units, and further, that they requiring interpretation as >> abstract characters in order to determine the semantics of the >> corresponding characters. >> >> My conclusion is that the determination of whether C1 is violated or not >> depends upon the domain, codomain, and operation being considered. >> >> >>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or >>> output, then you do need to either convert to FFFD or take some other >>> action. >>> >>> ------------------------------ >>> Mark <https://plus.google.com/114199149796022210033> >>> * >>> * >>> *— Il meglio è l’inimico del bene —* >>> ** >>> >>> >>> >>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote: >>> >>>> >>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < >>>> [email protected]> wrote: >>>> >>>>> The conformance clause doesn't say anything about the interpretation >>>>> of (UTF-16) code units as code points. To check conformance with C1, you >>>>> have to look at how the resulting code points are actually further >>>>> interpreted. >>>>> >>>> >>>> True, but if the proposed language >>>> >>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of >>>> a surrogate pair, is interpreted as a code point with the same value." >>>> >>>> is adopted, then will not this have an effect of creating unpaired >>>> surrogates as code points? If so, then by my estimation, this *will* >>>> increase >>>> the likelihood of their being interpreted as abstract characters... e.g., >>>> if the unpaired code unit is interpreted as a unpaired surrogate code >>>> point, and some process/function performs *any* predicate or transform >>>> on that code point, then that amounts to interpreting it as an abstract >>>> character. >>>> >>>> I would rather see such unpaired code unit either (1) be mapped to >>>> U+00FFFD, or (2) an exception raised when performing an operation that >>>> requires conversion of the UTF-16 code unit sequence. >>>> >>>> >>>>> My proposal interprets the resulting code points in the following ways: >>>>> >>>>> 1) In regular expressions, they can be used in both patterns and input >>>>> strings to be matched. They may be compared against other code points, or >>>>> against character classes, some of which will hopefully soon be defined by >>>>> Unicode properties. In the case of comparing against other code points, >>>>> they can't match any code points assigned to abstract characters. In the >>>>> case of Unicode properties, they'll typically fall into the large bucket >>>>> of >>>>> have-nots, along with other unassigned code points or, for example, >>>>> U+FFFD, >>>>> unless you ask for their general category. >>>>> >>>>> 2) When parsing identifiers, they will not have the ID_Start or >>>>> ID_Continue properties, so they'll be excluded, just like other unassigned >>>>> code points or U+FFFD. >>>>> >>>>> 3) In case conversion, they won't have upper case or lower case >>>>> equivalents defined, and remain as is, as would happen for unassigned code >>>>> points or U+FFFD. >>>>> >>>>> I don't think either of these amount to interpretation as abstract >>>>> characters. I mention U+FFFD because the alternative interpretation of >>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't >>>>> seem to improve anything. >>>>> >>>>> Norbert >>>>> >>>>> >>>>> >>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote: >>>>> >>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough < >>>>> [email protected]> wrote: >>>>> > I really like the direction you're going in, but have one minor >>>>> concern relating to regular expressions. >>>>> > >>>>> > In your proposal, you currently state: >>>>> > "A code unit that is in the range 0xD800 to 0xDFFF, but is >>>>> not part of a surrogate pair, is interpreted as a code point with the same >>>>> value." >>>>> > >>>>> > Just as a reminder, this would be in explicit violation of the >>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code >>>>> point will not be interpreted as an abstract character: >>>>> > >>>>> > C1 A process shall not interpret a high-surrogate code point or a >>>>> low-surrogate code point as an abstract character. >>>>> > >>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf >>>>> > >>>>> > Given that such guarantee is likely impractical, this presents a >>>>> problem for the above proposed language. >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> es-discuss mailing list >>>> [email protected] >>>> https://mail.mozilla.org/listinfo/es-discuss >>>> >>>> >>> >> >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

