Re: Full Unicode based on UTF-16 proposal

Glenn Adams Tue, 27 Mar 2012 13:14:39 -0700

ok, i'll accept your position at this point and drop my comment; i suppose
it is true that if there are already unpaired surrogates in user data as
UTF-16, then having unpaired surrogates as code points is no worse;


however, it would be useful if there were an informative pointer from the
spec under consideration to a UTC sanctioned list of operations that
constitute "interpreting as abstract characters" and, that, if used on such
data would possibly violate C1; to this end, it would be useful if C1
itself included a concrete example of such an operation

On Tue, Mar 27, 2012 at 2:02 PM, Mark Davis ☕ <[email protected]> wrote:

> >performing a predicate on that code point, such as described in D21
> (e.g., IsAlphabetic) would entail interpreting it as an abstract character?
> No.
>
> > but where does one draw the line?
>  The line is already drawn by the Unicode consortium, by consulting the 
> Unicode
> Character Database properties. If you look at the data in the Unicode
> Character Database for any particular property, say Alphabetic, you'll find
> that surrogate code points are not included where the property is a true
> character property. There are a few special cases where reserved code
> points are provisionally given "anticipatory" character properties, such as
> in bidi ranges, simply because that makes implementations is more forward
> compatible, but there aren't any cases where a "character" property applies
> to a surrogate code point (other than by returning "No", or "n/a", or some
> such).
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 12:07, Glenn Adams <[email protected]> wrote:
>
>> So, if as a result of a policy of converting any UTF-16 code unit
>> sequence to a code point sequence one ends up with an unpaired surrogate,
>> e.g., "\u{00DC00}", then performing a predicate on that code point, such as
>> described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
>> abstract character?
>>
>> I can see that D20 defines code point properties which would not entail
>> interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
>> but where does one draw the line?
>>
>>
>>  On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <[email protected]>wrote:
>>
>>> The point of C1 is that you can't interpret the surrogate code point
>>> U+DC00 as a *character*, like an "a".
>>>
>>> Neither can you interpret the reserved code point U+0378 as a
>>> *character*, like a "b".
>>>
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <[email protected]> wrote:
>>>
>>>> This begs the question of what is the point of C1.
>>>>
>>>>
>>>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <[email protected]>wrote:
>>>>
>>>>> That would not be practical, nor predictable. And note that the 700K
>>>>> reserved code points are also not to be interpreted as characters; by your
>>>>> logic all of them would need to be converted to FFFD.
>>>>>
>>>>> And in practice, an unpaired surrogate is best treated just like a
>>>>> reserved (unassigned) code point. For example, a lowercase operation 
>>>>> should
>>>>> convert characters with lowercase correspondants to those correspondants,
>>>>> and leave *everything* else alone: control characters, format characters,
>>>>> reserved code points, surrogates, etc.
>>>>>
>>>>> ------------------------------
>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>> *
>>>>> *
>>>>> *— Il meglio è l’inimico del bene —*
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <[email protected]>wrote:
>>>>>>
>>>>>>> That, as Norbert explained, is not the intention of the standard.
>>>>>>> Take a look at the discussion of "Unicode 16-bit string" in chapter 3. 
>>>>>>> The
>>>>>>> committee recognized that fragments may be formed when working with 
>>>>>>> UTF-16,
>>>>>>> and that destructive changes may do more harm than good.
>>>>>>>
>>>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>>>>
>>>>>>> After this operation is done, you want y == a, even if 5 is between
>>>>>>> D800 and DC00.
>>>>>>>
>>>>>>
>>>>>> Assuming that b.length() == 1 in this example, my interpretation of
>>>>>> this is that '=', '+', and 'substring' are operations whose domain and
>>>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16
>>>>>> code units. Since none of these operations entail interpreting the
>>>>>> semantics of a code point (i.e., interpreting abstract characters), then
>>>>>> there is no violation of C1 here.
>>>>>>
>>>>>> Or take:
>>>>>>> output = "";
>>>>>>> for (int i = 0; i < s.length(); ++i) {
>>>>>>>   ch = s.charAt(i);
>>>>>>>   if (ch.equals('&')) {
>>>>>>>     ch = '@';
>>>>>>>   }
>>>>>>>   output += ch;
>>>>>>> }
>>>>>>>
>>>>>>> After this operation is done, you want "a&\u{10000}b" to become 
>>>>>>> "a@\u{10000}b",
>>>>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>>>>> It is also an unnecessary burden on lower-level software to always
>>>>>>> check this stuff.
>>>>>>>
>>>>>>
>>>>>> Again, in this example, I assume that the string literal
>>>>>> "a&\u{10000}b" maps to the UTF-16 code unit sequence:
>>>>>>
>>>>>> 0061 0026 D800 DC00 0062
>>>>>>
>>>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>>>>> not code points, and since the 'equals' operator is also defined on code
>>>>>> units, this example also does not require interpreting the semantics of
>>>>>> code points (i.e., interpreting abstract characters).
>>>>>>
>>>>>> However, in Norbert's questions above about isUUppercase(int) and
>>>>>> toUpperCase(int), it is clear that the domain of these operations are 
>>>>>> code
>>>>>> points, not code units, and further, that they requiring interpretation 
>>>>>> as
>>>>>> abstract characters in order to determine the semantics of the
>>>>>> corresponding characters.
>>>>>>
>>>>>> My conclusion is that the determination of whether C1 is violated or
>>>>>> not depends upon the domain, codomain, and operation being considered.
>>>>>>
>>>>>>
>>>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage
>>>>>>> or output, then you do need to either convert to FFFD or take some other
>>>>>>> action.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>>>> *
>>>>>>> *
>>>>>>> *— Il meglio è l’inimico del bene —*
>>>>>>> **
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> The conformance clause doesn't say anything about the
>>>>>>>>> interpretation of (UTF-16) code units as code points. To check 
>>>>>>>>> conformance
>>>>>>>>> with C1, you have to look at how the resulting code points are 
>>>>>>>>> actually
>>>>>>>>> further interpreted.
>>>>>>>>>
>>>>>>>>
>>>>>>>> True, but if the proposed language
>>>>>>>>
>>>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
>>>>>>>> of a surrogate pair, is interpreted as a code point with the same 
>>>>>>>> value."
>>>>>>>>
>>>>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>>>>> surrogates as code points? If so, then by my estimation, this *will
>>>>>>>> * increase the likelihood of their being interpreted as abstract
>>>>>>>> characters... e.g., if the unpaired code unit is interpreted as a 
>>>>>>>> unpaired
>>>>>>>> surrogate code point, and some process/function performs *any* 
>>>>>>>> predicate
>>>>>>>> or transform on that code point, then that amounts to interpreting it 
>>>>>>>> as an
>>>>>>>> abstract character.
>>>>>>>>
>>>>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>>>>
>>>>>>>>
>>>>>>>>> My proposal interprets the resulting code points in the following
>>>>>>>>> ways:
>>>>>>>>>
>>>>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>>>>> input strings to be matched. They may be compared against other code
>>>>>>>>> points, or against character classes, some of which will hopefully 
>>>>>>>>> soon be
>>>>>>>>> defined by Unicode properties. In the case of comparing against other 
>>>>>>>>> code
>>>>>>>>> points, they can't match any code points assigned to abstract 
>>>>>>>>> characters.
>>>>>>>>> In the case of Unicode properties, they'll typically fall into the 
>>>>>>>>> large
>>>>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>>>>
>>>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>>>>> ID_Continue properties, so they'll be excluded, just like other 
>>>>>>>>> unassigned
>>>>>>>>> code points or U+FFFD.
>>>>>>>>>
>>>>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>>>>> equivalents defined, and remain as is, as would happen for unassigned 
>>>>>>>>> code
>>>>>>>>> points or U+FFFD.
>>>>>>>>>
>>>>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that 
>>>>>>>>> doesn't
>>>>>>>>> seem to improve anything.
>>>>>>>>>
>>>>>>>>> Norbert
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>>>>
>>>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > I really like the direction you're going in, but have one minor
>>>>>>>>> concern relating to regular expressions.
>>>>>>>>> >
>>>>>>>>> > In your proposal, you currently state:
>>>>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but
>>>>>>>>> is not part of a surrogate pair, is interpreted as a code point with 
>>>>>>>>> the
>>>>>>>>> same value."
>>>>>>>>> >
>>>>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a 
>>>>>>>>> code
>>>>>>>>> point will not be interpreted as an abstract character:
>>>>>>>>> >
>>>>>>>>> > C1    A process shall not interpret a high-surrogate code point
>>>>>>>>> or a low-surrogate code point as an abstract character.
>>>>>>>>> >
>>>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>>>>> >
>>>>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>>>>> problem for the above proposed language.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> es-discuss mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to