OK, I guess we have to have Unicode code point escapes :-)
I'd expect them to work in identifiers, string literals, and regular
expressions (possibly with restrictions coming out of today's emails), but not
in JSON source.
Norbert
On Mar 26, 2012, at 4:45 , Roger Andrews wrote:
> The strawman is for source code characters, and says it has "no implications
> for string value encodings" (or RegExps).
> String & regexp literal escape sequences are explicitly defined in ES5
> sections 7.8.4 & 7.8.5.
> Will Strawman style also work in ES6 string & regexp literals? Thus making
> regexp ranges much nicer (see final example below).
>
>
> As well as describing code points that have not yet been defined as
> characters, character escapes in string literals and regexps are good:
> 1) control characters don't have glyphs at all,
> 2) the various space glyphs are not readily distinguishable (same for some
> dash/minus/line glyphs),
> 3) breaking/non-breaking versions of characters are not distinguishable,
> 4) many other glyphs are hard to distinguish (being tiny adjustments in
> positioning or form detail),
> 5) some characters are "combining" -- which makes for a messy and confusing
> program if you use them raw.
>
> If you use the raw non-ASCII characters in a program then you need some means
> of creating them, preferably via a normal keyboard and in your favourite text
> editor.
> All program readers need appropriate fonts installed to fully understand the
> program, and program maintainers also need a Unicode-capable text editor
> (potentially including non-BMP support).
> All links/stores that the program travels over or rests in must be
> Unicode-capable.
> Whereas using only ASCII chars to write a program is easy to do and always
> works no matter how basic your computing/transmission infrastructure. (ASCII
> chars never get silently mangled in transmission or text editors.)
>
> How to represent character escapes in a language.
> C/C++ has:
> \xNN 8-bit char (U+0000 - U+00FF)
> \uNNNN 16-bit char (U+0000 - U+FFFF)
> \UNNNNNNNN 32-bit char (i.e. any 21-bit Unicode char)
> Strawman for source chars has:
> \u{N...} 8 to 24-bit char (i.e. any 21-bit Unicode char)
>
>
> I'm struggling with how non-BMP escapes would be used in practice in strings
> & regexps -- especially regexp ranges. Will Strawman style be used in string
> & regexp literals?
>
> Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").
>
> To create the string "I like 𝌇" using escapes
> in C/C++ you can create a string:
> "I like \U0001D307"
> if the Strawman style works in strings, in ES6 presumably you say:
> "I like \u{1D307}"
> or do you have to know UTF-16 encoding rules and say:
> "I like \uD834\uDF07"
>
> To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/
> should the programmer write:
> C/C++ style
> /[\U0001D307-\U0001D356]/
> or will Strawman style work in regexps
> /[\u{1D307}-\u{1D356}]/
> or in UTF-16 with {} grouping
> /[{\uD834\uDF07}-{\uD834\uDF56}]/
>
> Either C/C++ style or Strawman style escape is readable, natural, doesn't
> require knowledge of UTF-16 encoding rules, can be created easily with any
> old keyboard, and won't upset text editors.
>
> It's a bit unfriendly to require programmers to know UTF-16 rules just to
> put a non-BMP character in a string or regexp using an escape. And in a
> regexp range it looks ugly and confusing.
>
>
> --------------------------------------------------
> From: "Norbert Lindenberg"
>>
>> There is a strawman for code point escapes:
>> http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences
>>
>> Note that for references to specific characters it's usually best to just
>> use the characters directly, as Dave did in
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as
>> regular expressions where you might have to refer to range limits that
>> aren't actually assigned characters, or in test cases where you might use
>> characters for which your OS doesn't have glyphs yet.
>>
>> Norbert
>>
>>
>> On Mar 25, 2012, at 2:57 , Roger Andrews wrote:
>>
>>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
>>> literals. The \U format expresses a full 32-bit code, which could be
>>> mapped internally to two 16-bit UTF-16 codes.
>>>
>>> Then the programmer can describe exactly the required characters without
>>> caring about their coding in UTF-16 or whatever.
>>>
>>> Could you use this to avoid complicated things in RegExps like
>>> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
>>> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
>>> interest?
>>>
>>> The same goes for String literals, where the programmer does not really
>>> care about the encoding, just specifying the character.
>>>
>>> (Sorry if I've missed something in the prior discussion.)
>>>
>>> --------------------------------------------------
>>> From: "Norbert Lindenberg"
>>> To: "David Herman"
>>>>
>>>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>>>>
>>>> [snip]
>>>>
>>>>>> As for whether the switch to code-point-based matching should be
>>>>>> universal or require /u (an issue that your proposal leaves open),
>>>>>> IMHO it's better to require /u since it avoids the need for
>>>>>> transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}]
>>>>>> and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}],
>>>>>> and additionally avoids as least three potentially breaking changes
>>>>>> (two of which are explicitly mentioned in your proposal):
>>>>>
>>>>> I haven't completely understood this part of the discussion. Looking at
>>>>> /u as a "little red switch" (LRS), i.e., an opportunity to make
>>>>> judicious breaks with compatibility, could we not allow character
>>>>> classes with unescaped non-BMP code points, e.g.:
>>>>>
>>>>> js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>>> ["𝌆𝌇𝌈𝌉𝌊"]
>>>>>
>>>>> I'm still getting up to speed on Unicode and JS string semantics, so
>>>>> I'm guessing that I'm missing a reason why that wouldn't work...
>>>>> Presumably the JS source of the regexp literal, as a sequence of UTF-16
>>>>> code units, represents the tetragram code points as surrogate pairs.
>>>>> Can we not recognize surrogate pairs in character classes within a /u
>>>>> regexp and interpret them as code points?
>>>>
>>>> With /u, that's exactly what happens. My first proposal was to make this
>>>> happen even without a new flag, i.e., make
>>>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>>>> work based on code points, and Steve is arguing against that because of
>>>> compatibility risk. My proposal also includes some transformations to
>>>> keep existing regular expressions working, and Steve correctly observes
>>>> that if we have a flag for code point mode, then the transformation is
>>>> not needed - old regular expressions would continue to work in code unit
>>>> mode, while new regular expressions with /u get code point treatment.
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss