Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Sun, 25 Mar 2012 23:47:56 -0700

There is a strawman for code point escapes:
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences


Note that for references to specific characters it's usually best to just use 
the characters directly, as Dave did in "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can 
be useful in cases such as regular expressions where you might have to refer to 
range limits that aren't actually assigned characters, or in test cases where 
you might use characters for which your OS doesn't have glyphs yet.

Norbert


On Mar 25, 2012, at 2:57 , Roger Andrews wrote:

> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character 
> literals.  The \U format expresses a full 32-bit code, which could be mapped 
> internally to two 16-bit UTF-16 codes.
> 
> Then the programmer can describe exactly the required characters without 
> caring about their coding in UTF-16 or whatever.
> 
> Could you use this to avoid complicated things in RegExps like 
> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like 
> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of interest?
> 
> The same goes for String literals, where the programmer does not really care 
> about the encoding, just specifying the character.
> 
> (Sorry if I've missed something in the prior discussion.)
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
> To: "David Herman"
>> 
>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>> 
>> [snip]
>> 
>>>> As for whether the switch to code-point-based matching should be universal 
>>>> or require /u (an issue that your proposal leaves open), IMHO it's better 
>>>> to require /u since it avoids the need for transforming 
>>>> \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and 
>>>> [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and 
>>>> additionally avoids as least three potentially breaking changes (two of 
>>>> which are explicitly mentioned in your proposal):
>>> 
>>> I haven't completely understood this part of the discussion. Looking at /u 
>>> as a "little red switch" (LRS), i.e., an opportunity to make judicious 
>>> breaks with compatibility, could we not allow character classes with 
>>> unescaped non-BMP code points, e.g.:
>>> 
>>>   js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>   ["𝌆𝌇𝌈𝌉𝌊"]
>>> 
>>> I'm still getting up to speed on Unicode and JS string semantics, so I'm 
>>> guessing that I'm missing a reason why that wouldn't work... Presumably the 
>>> JS source of the regexp literal, as a sequence of UTF-16 code units, 
>>> represents the tetragram code points as surrogate pairs. Can we not 
>>> recognize surrogate pairs in character classes within a /u regexp and 
>>> interpret them as code points?
>> 
>> With /u, that's exactly what happens. My first proposal was to make this 
>> happen even without a new flag, i.e., make
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>> work based on code points, and Steve is arguing against that because of 
>> compatibility risk. My proposal also includes some transformations to keep 
>> existing regular expressions working, and Steve correctly observes that if 
>> we have a flag for code point mode, then the transformation is not needed - 
>> old regular expressions would continue to work in code unit mode, while new 
>> regular expressions with /u get code point treatment.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to