Re: Full Unicode based on UTF-16 proposal

Mark Davis ☕ Tue, 27 Mar 2012 10:15:51 -0700

The point of C1 is that you can't interpret the surrogate code point U+DC00
as a *character*, like an "a".


Neither can you interpret the reserved code point U+0378 as a *character*,
like a "b".

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 08:56, Glenn Adams <[email protected]> wrote:

> This begs the question of what is the point of C1.
>
>
> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]> wrote:
>
>> That would not be practical, nor predictable. And note that the 700K
>> reserved code points are also not to be interpreted as characters; by your
>> logic all of them would need to be converted to FFFD.
>>
>> And in practice, an unpaired surrogate is best treated just like a
>> reserved (unassigned) code point. For example, a lowercase operation should
>> convert characters with lowercase correspondants to those correspondants,
>> and leave *everything* else alone: control characters, format characters,
>> reserved code points, surrogates, etc.
>>
>> ------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio è l’inimico del bene —*
>> **
>>
>>
>>
>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote:
>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]>wrote:
>>>
>>>> That, as Norbert explained, is not the intention of the standard. Take
>>>> a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>> committee recognized that fragments may be formed when working with UTF-16,
>>>> and that destructive changes may do more harm than good.
>>>>
>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>
>>>> After this operation is done, you want y == a, even if 5 is between
>>>> D800 and DC00.
>>>>
>>>
>>> Assuming that b.length() == 1 in this example, my interpretation of this
>>> is that '=', '+', and 'substring' are operations whose domain and co-domain
>>> are (currently defined) ES Strings, namely sequences of UTF-16 code units.
>>> Since none of these operations entail interpreting the semantics of a code
>>> point (i.e., interpreting abstract characters), then there is no violation
>>> of C1 here.
>>>
>>> Or take:
>>>> output = "";
>>>> for (int i = 0; i < s.length(); ++i) {
>>>>   ch = s.charAt(i);
>>>>   if (ch.equals('&')) {
>>>>     ch = '@';
>>>>   }
>>>>   output += ch;
>>>> }
>>>>
>>>> After this operation is done, you want "a&\u{10000}b" to become 
>>>> "a@\u{10000}b",
>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>> It is also an unnecessary burden on lower-level software to always
>>>> check this stuff.
>>>>
>>>
>>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>>> maps to the UTF-16 code unit sequence:
>>>
>>> 0061 0026 D800 DC00 0062
>>>
>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>> not code points, and since the 'equals' operator is also defined on code
>>> units, this example also does not require interpreting the semantics of
>>> code points (i.e., interpreting abstract characters).
>>>
>>> However, in Norbert's questions above about isUUppercase(int) and
>>> toUpperCase(int), it is clear that the domain of these operations are code
>>> points, not code units, and further, that they requiring interpretation as
>>> abstract characters in order to determine the semantics of the
>>> corresponding characters.
>>>
>>> My conclusion is that the determination of whether C1 is violated or not
>>> depends upon the domain, codomain, and operation being considered.
>>>
>>>
>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>>> output, then you do need to either convert to FFFD or take some other
>>>> action.
>>>>
>>>> ------------------------------
>>>> Mark <https://plus.google.com/114199149796022210033>
>>>> *
>>>> *
>>>> *— Il meglio è l’inimico del bene —*
>>>> **
>>>>
>>>>
>>>>
>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote:
>>>>
>>>>>
>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The conformance clause doesn't say anything about the interpretation
>>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>>> have to look at how the resulting code points are actually further
>>>>>> interpreted.
>>>>>>
>>>>>
>>>>> True, but if the proposed language
>>>>>
>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
>>>>> a surrogate pair, is interpreted as a code point with the same value."
>>>>>
>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>> surrogates as code points? If so, then by my estimation, this *will* 
>>>>> increase
>>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>>> point, and some process/function performs *any* predicate or
>>>>> transform on that code point, then that amounts to interpreting it as an
>>>>> abstract character.
>>>>>
>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>
>>>>>
>>>>>> My proposal interprets the resulting code points in the following
>>>>>> ways:
>>>>>>
>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>> input strings to be matched. They may be compared against other code
>>>>>> points, or against character classes, some of which will hopefully soon 
>>>>>> be
>>>>>> defined by Unicode properties. In the case of comparing against other 
>>>>>> code
>>>>>> points, they can't match any code points assigned to abstract characters.
>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>
>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>> ID_Continue properties, so they'll be excluded, just like other 
>>>>>> unassigned
>>>>>> code points or U+FFFD.
>>>>>>
>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>> equivalents defined, and remain as is, as would happen for unassigned 
>>>>>> code
>>>>>> points or U+FFFD.
>>>>>>
>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that 
>>>>>> doesn't
>>>>>> seem to improve anything.
>>>>>>
>>>>>> Norbert
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>
>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>> [email protected]> wrote:
>>>>>> > I really like the direction you're going in, but have one minor
>>>>>> concern relating to regular expressions.
>>>>>> >
>>>>>> > In your proposal, you currently state:
>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>>> not part of a surrogate pair, is interpreted as a code point with the 
>>>>>> same
>>>>>> value."
>>>>>> >
>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a 
>>>>>> code
>>>>>> point will not be interpreted as an abstract character:
>>>>>> >
>>>>>> > C1    A process shall not interpret a high-surrogate code point or
>>>>>> a low-surrogate code point as an abstract character.
>>>>>> >
>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>> >
>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>> problem for the above proposed language.
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> es-discuss mailing list
>>>>> [email protected]
>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to