Re: Full Unicode based on UTF-16 proposal

Glenn Adams Tue, 27 Mar 2012 08:03:07 -0700

On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]> wrote:


> That, as Norbert explained, is not the intention of the standard. Take a
> look at the discussion of "Unicode 16-bit string" in chapter 3. The
> committee recognized that fragments may be formed when working with UTF-16,
> and that destructive changes may do more harm than good.
>
> x = a.substring(0, 5) + b + a.substring(5, a.length());
> y = x.substring(0, 5) + x.substring(6, x.length());
>
> After this operation is done, you want y == a, even if 5 is between D800
> and DC00.
>

Assuming that b.length() == 1 in this example, my interpretation of this is
that '=', '+', and 'substring' are operations whose domain and co-domain
are (currently defined) ES Strings, namely sequences of UTF-16 code units.
Since none of these operations entail interpreting the semantics of a code
point (i.e., interpreting abstract characters), then there is no violation
of C1 here.

Or take:
> output = "";
> for (int i = 0; i < s.length(); ++i) {
>   ch = s.charAt(i);
>   if (ch.equals('&')) {
>     ch = '@';
>   }
>   output += ch;
> }
>
> After this operation is done, you want "a&\u{10000}b" to become 
> "a@\u{10000}b",
> not "a&\u{FFFD}\u{FFFD}b".
> It is also an unnecessary burden on lower-level software to always check
> this stuff.
>

Again, in this example, I assume that the string literal "a&\u{10000}b"
maps to the UTF-16 code unit sequence:

0061 0026 D800 DC00 0062

Given that 'charAt(i)' is defined on (and is indexing) code units and not
code points, and since the 'equals' operator is also defined on code units,
this example also does not require interpreting the semantics of code
points (i.e., interpreting abstract characters).

However, in Norbert's questions above about isUUppercase(int) and
toUpperCase(int), it is clear that the domain of these operations are code
points, not code units, and further, that they requiring interpretation as
abstract characters in order to determine the semantics of the
corresponding characters.

My conclusion is that the determination of whether C1 is violated or not
depends upon the domain, codomain, and operation being considered.


> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
> output, then you do need to either convert to FFFD or take some other
> action.
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote:
>
>>
>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>> [email protected]> wrote:
>>
>>> The conformance clause doesn't say anything about the interpretation of
>>> (UTF-16) code units as code points. To check conformance with C1, you have
>>> to look at how the resulting code points are actually further interpreted.
>>>
>>
>> True, but if the proposed language
>>
>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
>> surrogate pair, is interpreted as a code point with the same value."
>>
>> is adopted, then will not this have an effect of creating unpaired
>> surrogates as code points? If so, then by my estimation, this *will* increase
>> the likelihood of their being interpreted as abstract characters... e.g.,
>> if the unpaired code unit is interpreted as a unpaired surrogate code
>> point, and some process/function performs *any* predicate or transform
>> on that code point, then that amounts to interpreting it as an abstract
>> character.
>>
>> I would rather see such unpaired code unit either (1) be mapped to
>> U+00FFFD, or (2) an exception raised when performing an operation that
>> requires conversion of the UTF-16 code unit sequence.
>>
>>
>>> My proposal interprets the resulting code points in the following ways:
>>>
>>> 1) In regular expressions, they can be used in both patterns and input
>>> strings to be matched. They may be compared against other code points, or
>>> against character classes, some of which will hopefully soon be defined by
>>> Unicode properties. In the case of comparing against other code points,
>>> they can't match any code points assigned to abstract characters. In the
>>> case of Unicode properties, they'll typically fall into the large bucket of
>>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>>> unless you ask for their general category.
>>>
>>> 2) When parsing identifiers, they will not have the ID_Start or
>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>> code points or U+FFFD.
>>>
>>> 3) In case conversion, they won't have upper case or lower case
>>> equivalents defined, and remain as is, as would happen for unassigned code
>>> points or U+FFFD.
>>>
>>> I don't think either of these amount to interpretation as abstract
>>> characters. I mention U+FFFD because the alternative interpretation of
>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>> seem to improve anything.
>>>
>>> Norbert
>>>
>>>
>>>
>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>
>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>> [email protected]> wrote:
>>> > I really like the direction you're going in, but have one minor
>>> concern relating to regular expressions.
>>> >
>>> > In your proposal, you currently state:
>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
>>> part of a surrogate pair, is interpreted as a code point with the same
>>> value."
>>> >
>>> > Just as a reminder, this would be in explicit violation of the Unicode
>>> conformance clause C1 unless it can be guaranteed that such a code point
>>> will not be interpreted as an abstract character:
>>> >
>>> > C1    A process shall not interpret a high-surrogate code point or a
>>> low-surrogate code point as an abstract character.
>>> >
>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>> >
>>> > Given that such guarantee is likely impractical, this presents a
>>> problem for the above proposed language.
>>>
>>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> [email protected]
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to