Re: Full Unicode based on UTF-16 proposal

Glenn Adams Tue, 27 Mar 2012 12:08:03 -0700

So, if as a result of a policy of converting any UTF-16 code unit sequence
to a code point sequence one ends up with an unpaired surrogate, e.g.,
"\u{00DC00}", then performing a predicate on that code point, such as
described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
abstract character?


I can see that D20 defines code point properties which would not entail
interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
but where does one draw the line?

On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <[email protected]> wrote:

> The point of C1 is that you can't interpret the surrogate code point
> U+DC00 as a *character*, like an "a".
>
> Neither can you interpret the reserved code point U+0378 as a *character*,
> like a "b".
>
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <[email protected]> wrote:
>
>> This begs the question of what is the point of C1.
>>
>>
>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]> wrote:
>>
>>> That would not be practical, nor predictable. And note that the 700K
>>> reserved code points are also not to be interpreted as characters; by your
>>> logic all of them would need to be converted to FFFD.
>>>
>>> And in practice, an unpaired surrogate is best treated just like a
>>> reserved (unassigned) code point. For example, a lowercase operation should
>>> convert characters with lowercase correspondants to those correspondants,
>>> and leave *everything* else alone: control characters, format characters,
>>> reserved code points, surrogates, etc.
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]>wrote:
>>>>
>>>>> That, as Norbert explained, is not the intention of the standard. Take
>>>>> a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>>> committee recognized that fragments may be formed when working with 
>>>>> UTF-16,
>>>>> and that destructive changes may do more harm than good.
>>>>>
>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>>
>>>>> After this operation is done, you want y == a, even if 5 is between
>>>>> D800 and DC00.
>>>>>
>>>>
>>>> Assuming that b.length() == 1 in this example, my interpretation of
>>>> this is that '=', '+', and 'substring' are operations whose domain and
>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16
>>>> code units. Since none of these operations entail interpreting the
>>>> semantics of a code point (i.e., interpreting abstract characters), then
>>>> there is no violation of C1 here.
>>>>
>>>> Or take:
>>>>> output = "";
>>>>> for (int i = 0; i < s.length(); ++i) {
>>>>>   ch = s.charAt(i);
>>>>>   if (ch.equals('&')) {
>>>>>     ch = '@';
>>>>>   }
>>>>>   output += ch;
>>>>> }
>>>>>
>>>>> After this operation is done, you want "a&\u{10000}b" to become 
>>>>> "a@\u{10000}b",
>>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>>> It is also an unnecessary burden on lower-level software to always
>>>>> check this stuff.
>>>>>
>>>>
>>>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>>>> maps to the UTF-16 code unit sequence:
>>>>
>>>> 0061 0026 D800 DC00 0062
>>>>
>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>>> not code points, and since the 'equals' operator is also defined on code
>>>> units, this example also does not require interpreting the semantics of
>>>> code points (i.e., interpreting abstract characters).
>>>>
>>>> However, in Norbert's questions above about isUUppercase(int) and
>>>> toUpperCase(int), it is clear that the domain of these operations are code
>>>> points, not code units, and further, that they requiring interpretation as
>>>> abstract characters in order to determine the semantics of the
>>>> corresponding characters.
>>>>
>>>> My conclusion is that the determination of whether C1 is violated or
>>>> not depends upon the domain, codomain, and operation being considered.
>>>>
>>>>
>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>>>> output, then you do need to either convert to FFFD or take some other
>>>>> action.
>>>>>
>>>>> ------------------------------
>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>> *
>>>>> *
>>>>> *— Il meglio è l’inimico del bene —*
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The conformance clause doesn't say anything about the interpretation
>>>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>>>> have to look at how the resulting code points are actually further
>>>>>>> interpreted.
>>>>>>>
>>>>>>
>>>>>> True, but if the proposed language
>>>>>>
>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
>>>>>> of a surrogate pair, is interpreted as a code point with the same value."
>>>>>>
>>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>>> surrogates as code points? If so, then by my estimation, this *will* 
>>>>>> increase
>>>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>>>> point, and some process/function performs *any* predicate or
>>>>>> transform on that code point, then that amounts to interpreting it as an
>>>>>> abstract character.
>>>>>>
>>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>>
>>>>>>
>>>>>>> My proposal interprets the resulting code points in the following
>>>>>>> ways:
>>>>>>>
>>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>>> input strings to be matched. They may be compared against other code
>>>>>>> points, or against character classes, some of which will hopefully soon 
>>>>>>> be
>>>>>>> defined by Unicode properties. In the case of comparing against other 
>>>>>>> code
>>>>>>> points, they can't match any code points assigned to abstract 
>>>>>>> characters.
>>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>>
>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>>> ID_Continue properties, so they'll be excluded, just like other 
>>>>>>> unassigned
>>>>>>> code points or U+FFFD.
>>>>>>>
>>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>>> equivalents defined, and remain as is, as would happen for unassigned 
>>>>>>> code
>>>>>>> points or U+FFFD.
>>>>>>>
>>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that 
>>>>>>> doesn't
>>>>>>> seem to improve anything.
>>>>>>>
>>>>>>> Norbert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>>
>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>>> [email protected]> wrote:
>>>>>>> > I really like the direction you're going in, but have one minor
>>>>>>> concern relating to regular expressions.
>>>>>>> >
>>>>>>> > In your proposal, you currently state:
>>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>>>> not part of a surrogate pair, is interpreted as a code point with the 
>>>>>>> same
>>>>>>> value."
>>>>>>> >
>>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a 
>>>>>>> code
>>>>>>> point will not be interpreted as an abstract character:
>>>>>>> >
>>>>>>> > C1    A process shall not interpret a high-surrogate code point or
>>>>>>> a low-surrogate code point as an abstract character.
>>>>>>> >
>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>>> >
>>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>>> problem for the above proposed language.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> es-discuss mailing list
>>>>>> [email protected]
>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to