Just confirmed C/C++ do allow \Uxxxxxxxx escaped characters for non-BMP code
points in string literals.
Interesting page at:
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm
So C/C++ has:
\xNN 8-bit character (U+0000 - U+00FF)
\uNNNN 16-bit character
\UNNNNNNNN 32-bit character
This naturally expresses any character, without worrying about the UTF-16 or
whatever encoding.
--------------------------------------------------
From: "Roger Andrews"
To: "Norbert Lindenberg"
Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
literals. The \U format expresses a full 32-bit code, which could be
mapped internally to two 16-bit UTF-16 codes.
Then the programmer can describe exactly the required characters without
caring about their coding in UTF-16 or whatever.
Could you use this to avoid complicated things in RegExps like
[{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
[\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
interest?
The same goes for String literals, where the programmer does not really
care about the encoding, just specifying the character.
(Sorry if I've missed something in the prior discussion.)
--------------------------------------------------
From: "Norbert Lindenberg"
To: "David Herman"
On Mar 24, 2012, at 12:21 , David Herman wrote:
[snip]
As for whether the switch to code-point-based matching should be
universal or require /u (an issue that your proposal leaves open), IMHO
it's better to require /u since it avoids the need for transforming
\uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and
[\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and
additionally avoids as least three potentially breaking changes (two of
which are explicitly mentioned in your proposal):
I haven't completely understood this part of the discussion. Looking at
/u as a "little red switch" (LRS), i.e., an opportunity to make
judicious breaks with compatibility, could we not allow character
classes with unescaped non-BMP code points, e.g.:
js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
["𝌆𝌇𝌈𝌉𝌊"]
I'm still getting up to speed on Unicode and JS string semantics, so I'm
guessing that I'm missing a reason why that wouldn't work... Presumably
the JS source of the regexp literal, as a sequence of UTF-16 code units,
represents the tetragram code points as surrogate pairs. Can we not
recognize surrogate pairs in character classes within a /u regexp and
interpret them as code points?
With /u, that's exactly what happens. My first proposal was to make this
happen even without a new flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to
keep existing regular expressions working, and Steve correctly observes
that if we have a flag for code point mode, then the transformation is
not needed - old regular expressions would continue to work in code unit
mode, while new regular expressions with /u get code point treatment.
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss