Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Sat, 24 Mar 2012 17:49:26 -0700

Thanks for the detailed comments! Replies below.

Norbert



On Mar 23, 2012, at 9:46 , Phillips, Addison wrote:

> Comments follow.
> 
> 1. Definition of string. You say:
> 
> --
> However,
>    ECMAScript does not place any restrictions or requirements on the sequence
>    of code units in a String value, so it may be ill-formed when interpreted
>    as a UTF-16 code unit sequence.
> --
> 
> I know what you mean, but others might not. Perhaps:
> 
> --
> However, ECMAScript does not place any restrictions or requirements on the 
> sequence of code units in a String value, so the sequence of code units may 
> contain code units that are not valid in Unicode or sequences that do not 
> represent Unicode code points (such as unpaired surrogates).
> --

I can add a note that ill-formed here means containing unpaired surrogates. If 
I read chapter 3 of the Unicode Standard correctly, there's no other way for 
UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - 
any 16-bit value can occur in a well-formed UTF-16 string.

> 2. In this section, I would define string after code unit and code point. I 
> would also include a definition of surrogates/surrogate pairs.

Makes sense.

> 3. Under "text interpretation" you say:
> 
> --
> For compatibility with existing applications, it
>  has to allow surrogate code points (code points between U+D800 and U+DFFF 
> which
>  can never represent characters).
> --
> 
> This would (see above) benefit from having a definition in place. As noted, 
> this is slightly incomplete, since surrogate code units are used to form 
> supplementary characters.

The text is about surrogate code points, not about surrogate code units.

> 4. 0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the 
> right thing here. It's just a nit that you never note this ;-).
> 
> 5. Editorial unnecessary ;-):
> 
> --
> This transformation is rather ugly, but I’m afraid it’s the price ECMAScript
>  has to pay for being 12 years late in supporting supplementary characters.
> --
> 
> 6. Under 'details' you suggest a number of renamings. Are these strictly 
> necessary? The term 'character' could be taken to mean 'code point' instead, 
> with an explanatory note.

Unfortunately, the term "character" is poisoned in ES5 by a redefinition as 
"code unit" (chapter 6). For ES6, I'd like the spec to be really clear where it 
means code units and where it means code points. Maybe we can then reintroduce 
"character" in ES7...

> 7. Skipping down a lot, to "section 6 source text", you propose:
> 
> --
> The text is expected to have been normalised
>    to Unicode Normalization Form C (Canonical Decomposition, followed by 
> Canonical
>    Composition), as described in Unicode Standard Annex #15.
> --
> 
> I think this should be removed or modified.

This sentence is essentially copied from ES5 (with corrected references), and 
as I copied it, I made a note to myself that we need to discuss normalization, 
just not as part of this proposal...

> Automatic application of NFC is not always desirable, as it can affect 
> presentation or processing. Perhaps:
> 
> --
> Normalization of the text to Unicode Normalization Form C (Canonical 
> Decomposition, followed by Canonical Composition), as described in Unicode 
> Standard Annex #15, is recommended when transcoding from another character 
> encoding.
> --
> 
> 8. In "7.6 Identifier Names and Identifiers" you don't actually forbid 
> unpaired surrogates or non-characters in the text (Identifier_Part:: does 
> this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as 
> the last character in an identifier.

I can add a note about surrogate code points and non-characters, but, as you 
say, they are already ruled out because they can't have the required Unicode 
properties ID_Start or ID_Continue.

The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules 
on where they would be allowed, but I'm not sure we have a strong case for 
changing the rules in ECMAScript.
http://www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters

> 9. "15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas 
> it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

Will fix.

> 10. In the section on "what about utf-32", you say: " and the code points 
> start at positions 1, 2, 3.". Of course this should be "... and the code 
> points start at positions 0, 1, 2."

Of course.

> Thanks for this proposal!
> 
> Addison
> 
>> -----Original Message-----
>> From: Norbert Lindenberg [mailto:[email protected]]
>> Sent: Thursday, March 22, 2012 10:14 PM
>> To: [email protected]
>> Subject: Re: Full Unicode based on UTF-16 proposal
>> 
>> I've updated the proposal based on the feedback received so far. Changes are
>> listed in the Updates section.
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
>> 
>> Norbert
>> 
>> 
>> On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:
>> 
>>> Based on my prioritization of goals for support for full Unicode in 
>>> ECMAScript
>> [1], I've put together a proposal for supporting the full Unicode character 
>> set
>> based on the existing representation of text in ECMAScript using UTF-16 code
>> unit sequences:
>>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-
>> characters/index.html
>>> 
>>> The detailed proposed spec changes serve to get a good idea of the scope of
>> the changes, but will need some polishing.
>>> 
>>> Comments?
>>> 
>>> Thanks,
>>> Norbert
>>> 
>>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>>> 
>> 
> 

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to