Re: Full Unicode based on UTF-16 proposal

Roger Andrews Sun, 25 Mar 2012 04:03:13 -0700

Just confirmed C/C++ do allow \Uxxxxxxxx escaped characters for non-BMP codepoints in string literals.


Interesting page at:
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm


So C/C++ has:
   \xNN                       8-bit character (U+0000 - U+00FF)
   \uNNNN                16-bit character
   \UNNNNNNNN   32-bit character

This naturally expresses any character, without worrying about the UTF-16 orwhatever encoding.


--------------------------------------------------
From: "Roger Andrews"
To: "Norbert Lindenberg"

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in characterliterals. The \U format expresses a full 32-bit code, which could bemapped internally to two 16-bit UTF-16 codes.
Then the programmer can describe exactly the required characters withoutcaring about their coding in UTF-16 or whatever.
Could you use this to avoid complicated things in RegExps like[{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like[\U0001xxxx-\U0003yyyy] -- naturally expressing the characters ofinterest?
The same goes for String literals, where the programmer does not reallycare about the encoding, just specifying the character.
(Sorry if I've missed something in the prior discussion.)

--------------------------------------------------
From: "Norbert Lindenberg"
To: "David Herman"
On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]
As for whether the switch to code-point-based matching should beuniversal or require /u (an issue that your proposal leaves open), IMHOit's better to require /u since it avoids the need for transforming\uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and[\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], andadditionally avoids as least three potentially breaking changes (two ofwhich are explicitly mentioned in your proposal):
I haven't completely understood this part of the discussion. Looking at/u as a "little red switch" (LRS), i.e., an opportunity to makejudicious breaks with compatibility, could we not allow characterclasses with unescaped non-BMP code points, e.g.:
   js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
   ["𝌆𝌇𝌈𝌉𝌊"]
I'm still getting up to speed on Unicode and JS string semantics, so I'mguessing that I'm missing a reason why that wouldn't work... Presumablythe JS source of the regexp literal, as a sequence of UTF-16 code units,represents the tetragram code points as surrogate pairs. Can we notrecognize surrogate pairs in character classes within a /u regexp andinterpret them as code points?
With /u, that's exactly what happens. My first proposal was to make thishappen even without a new flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because ofcompatibility risk. My proposal also includes some transformations tokeep existing regular expressions working, and Steve correctly observesthat if we have a flag for code point mode, then the transformation isnot needed - old regular expressions would continue to work in code unitmode, while new regular expressions with /u get code point treatment.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to