Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Mon, 26 Mar 2012 19:28:15 -0700

OK, I guess we have to have Unicode code point escapes :-)

I'd expect them to work in identifiers, string literals, and regular 
expressions (possibly with restrictions coming out of today's emails), but not 
in JSON source.


Norbert


On Mar 26, 2012, at 4:45 , Roger Andrews wrote:

> The strawman is for source code characters, and says it has "no implications
> for string value encodings" (or RegExps).
> String & regexp literal escape sequences are explicitly defined in ES5 
> sections 7.8.4 & 7.8.5.
> Will Strawman style also work in ES6 string & regexp literals?  Thus making 
> regexp ranges much nicer (see final example below).
> 
> 
> As well as describing code points that have not yet been defined as
> characters, character escapes in string literals and regexps are good:
> 1)  control characters don't have glyphs at all,
> 2)  the various space glyphs are not readily distinguishable (same for some 
> dash/minus/line glyphs),
> 3)  breaking/non-breaking versions of characters are not distinguishable,
> 4)  many other glyphs are hard to distinguish (being tiny adjustments in
> positioning or form detail),
> 5)  some characters are "combining" -- which makes for a messy and confusing
> program if you use them raw.
> 
> If you use the raw non-ASCII characters in a program then you need some means 
> of creating them, preferably via a normal keyboard and in your favourite text 
> editor.
> All program readers need appropriate fonts installed to fully understand the 
> program, and program maintainers also need a Unicode-capable text editor 
> (potentially including non-BMP support).
> All links/stores that the program travels over or rests in must be
> Unicode-capable.
> Whereas using only ASCII chars to write a program is easy to do and always
> works no matter how basic your computing/transmission infrastructure. (ASCII 
> chars never get silently mangled in transmission or text editors.)
> 
> How to represent character escapes in a language.
> C/C++ has:
>   \xNN                        8-bit char (U+0000 - U+00FF)
>   \uNNNN                 16-bit char (U+0000 - U+FFFF)
>   \UNNNNNNNN    32-bit char (i.e. any 21-bit Unicode char)
> Strawman for source chars has:
>   \u{N...}               8 to 24-bit char (i.e. any 21-bit Unicode char)
> 
> 
> I'm struggling with how non-BMP escapes would be used in practice in strings
> & regexps -- especially regexp ranges.  Will Strawman style be used in string 
> & regexp literals?
> 
> Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").
> 
> To create the string "I like 𝌇" using escapes
> in C/C++ you can create a string:
>          "I like \U0001D307"
> if the Strawman style works in strings, in ES6 presumably you say:
>          "I like \u{1D307}"
> or do you have to know UTF-16 encoding rules and say:
>          "I like \uD834\uDF07"
> 
> To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/
> should the programmer write:
> C/C++ style
>           /[\U0001D307-\U0001D356]/
> or will Strawman style work in regexps
>           /[\u{1D307}-\u{1D356}]/
> or in UTF-16 with {} grouping
>           /[{\uD834\uDF07}-{\uD834\uDF56}]/
> 
> Either C/C++ style or Strawman style escape is readable, natural, doesn't
> require knowledge of UTF-16 encoding rules, can be created easily with any 
> old keyboard, and won't upset text editors.
> 
> It's a bit unfriendly to require programmers to know UTF-16 rules just to
> put a non-BMP character in a string or regexp using an escape.  And in a
> regexp range it looks ugly and confusing.
> 
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
>> 
>> There is a strawman for code point escapes:
>> http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences
>> 
>> Note that for references to specific characters it's usually best to just
>> use the characters directly, as Dave did in
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as
>> regular expressions where you might have to refer to range limits that
>> aren't actually assigned characters, or in test cases where you might use
>> characters for which your OS doesn't have glyphs yet.
>> 
>> Norbert
>> 
>> 
>> On Mar 25, 2012, at 2:57 , Roger Andrews wrote:
>> 
>>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
>>> literals.  The \U format expresses a full 32-bit code, which could be
>>> mapped internally to two 16-bit UTF-16 codes.
>>> 
>>> Then the programmer can describe exactly the required characters without
>>> caring about their coding in UTF-16 or whatever.
>>> 
>>> Could you use this to avoid complicated things in RegExps like
>>> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
>>> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
>>> interest?
>>> 
>>> The same goes for String literals, where the programmer does not really
>>> care about the encoding, just specifying the character.
>>> 
>>> (Sorry if I've missed something in the prior discussion.)
>>> 
>>> --------------------------------------------------
>>> From: "Norbert Lindenberg"
>>> To: "David Herman"
>>>> 
>>>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>>>> 
>>>> [snip]
>>>> 
>>>>>> As for whether the switch to code-point-based matching should be
>>>>>> universal or require /u (an issue that your proposal leaves open),
>>>>>> IMHO it's better to require /u since it avoids the need for
>>>>>> transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}]
>>>>>> and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}],
>>>>>> and additionally avoids as least three potentially breaking changes
>>>>>> (two of which are explicitly mentioned in your proposal):
>>>>> 
>>>>> I haven't completely understood this part of the discussion. Looking at
>>>>> /u as a "little red switch" (LRS), i.e., an opportunity to make
>>>>> judicious breaks with compatibility, could we not allow character
>>>>> classes with unescaped non-BMP code points, e.g.:
>>>>> 
>>>>>  js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>>>  ["𝌆𝌇𝌈𝌉𝌊"]
>>>>> 
>>>>> I'm still getting up to speed on Unicode and JS string semantics, so
>>>>> I'm guessing that I'm missing a reason why that wouldn't work...
>>>>> Presumably the JS source of the regexp literal, as a sequence of UTF-16
>>>>> code units, represents the tetragram code points as surrogate pairs.
>>>>> Can we not recognize surrogate pairs in character classes within a /u
>>>>> regexp and interpret them as code points?
>>>> 
>>>> With /u, that's exactly what happens. My first proposal was to make this
>>>> happen even without a new flag, i.e., make
>>>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>>>> work based on code points, and Steve is arguing against that because of
>>>> compatibility risk. My proposal also includes some transformations to
>>>> keep existing regular expressions working, and Steve correctly observes
>>>> that if we have a flag for code point mode, then the transformation is
>>>> not needed - old regular expressions would continue to work in code unit
>>>> mode, while new regular expressions with /u get code point treatment.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to