Re: Full Unicode based on UTF-16 proposal

Roger Andrews Mon, 26 Mar 2012 04:45:10 -0700

The strawman is for source code characters, and says it has "no implications
for string value encodings" (or RegExps).

String & regexp literal escape sequences are explicitly defined in ES5sections 7.8.4 & 7.8.5.Will Strawman style also work in ES6 string & regexp literals? Thus makingregexp ranges much nicer (see final example below).



As well as describing code points that have not yet been defined as
characters, character escapes in string literals and regexps are good:
1)  control characters don't have glyphs at all,

2) the various space glyphs are not readily distinguishable (same for somedash/minus/line glyphs),

3)  breaking/non-breaking versions of characters are not distinguishable,
4)  many other glyphs are hard to distinguish (being tiny adjustments in
positioning or form detail),
5)  some characters are "combining" -- which makes for a messy and confusing
program if you use them raw.

If you use the raw non-ASCII characters in a program then you need somemeans of creating them, preferably via a normal keyboard and in yourfavourite text editor.All program readers need appropriate fonts installed to fully understandthe program, and program maintainers also need a Unicode-capable text editor(potentially including non-BMP support).

All links/stores that the program travels over or rests in must be
Unicode-capable.
Whereas using only ASCII chars to write a program is easy to do and always

works no matter how basic your computing/transmission infrastructure.(ASCII chars never get silently mangled in transmission or text editors.)


How to represent character escapes in a language.
C/C++ has:
   \xNN                        8-bit char (U+0000 - U+00FF)
   \uNNNN                 16-bit char (U+0000 - U+FFFF)
   \UNNNNNNNN    32-bit char (i.e. any 21-bit Unicode char)
Strawman for source chars has:
   \u{N...}               8 to 24-bit char (i.e. any 21-bit Unicode char)


I'm struggling with how non-BMP escapes would be used in practice in strings

& regexps -- especially regexp ranges. Will Strawman style be used instring & regexp literals?


Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").

To create the string "I like 𝌇" using escapes
in C/C++ you can create a string:
          "I like \U0001D307"
if the Strawman style works in strings, in ES6 presumably you say:
          "I like \u{1D307}"
or do you have to know UTF-16 encoding rules and say:
          "I like \uD834\uDF07"

To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/
should the programmer write:
C/C++ style
           /[\U0001D307-\U0001D356]/
or will Strawman style work in regexps
           /[\u{1D307}-\u{1D356}]/
or in UTF-16 with {} grouping
           /[{\uD834\uDF07}-{\uD834\uDF56}]/

Either C/C++ style or Strawman style escape is readable, natural, doesn't

require knowledge of UTF-16 encoding rules, can be created easily with anyold keyboard, and won't upset text editors.


It's a bit unfriendly to require programmers to know UTF-16 rules just to
put a non-BMP character in a string or regexp using an escape.  And in a
regexp range it looks ugly and confusing.


--------------------------------------------------
From: "Norbert Lindenberg"


There is a strawman for code point escapes:
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just
use the characters directly, as Dave did in
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as
regular expressions where you might have to refer to range limits that
aren't actually assigned characters, or in test cases where you might use
characters for which your OS doesn't have glyphs yet.

Norbert


On Mar 25, 2012, at 2:57 , Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
literals.  The \U format expresses a full 32-bit code, which could be
mapped internally to two 16-bit UTF-16 codes.

Then the programmer can describe exactly the required characters without
caring about their coding in UTF-16 or whatever.

Could you use this to avoid complicated things in RegExps like
[{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
[\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
interest?

The same goes for String literals, where the programmer does not really
care about the encoding, just specifying the character.

(Sorry if I've missed something in the prior discussion.)

--------------------------------------------------
From: "Norbert Lindenberg"
To: "David Herman"


On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

As for whether the switch to code-point-based matching should be
universal or require /u (an issue that your proposal leaves open),
IMHO it's better to require /u since it avoids the need for
transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}]
and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}],
and additionally avoids as least three potentially breaking changes
(two of which are explicitly mentioned in your proposal):


I haven't completely understood this part of the discussion. Looking at
/u as a "little red switch" (LRS), i.e., an opportunity to make
judicious breaks with compatibility, could we not allow character
classes with unescaped non-BMP code points, e.g.:

  js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
  ["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed on Unicode and JS string semantics, so
I'm guessing that I'm missing a reason why that wouldn't work...
Presumably the JS source of the regexp literal, as a sequence of UTF-16
code units, represents the tetragram code points as surrogate pairs.
Can we not recognize surrogate pairs in character classes within a /u
regexp and interpret them as code points?


With /u, that's exactly what happens. My first proposal was to make this
happen even without a new flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to
keep existing regular expressions working, and Steve correctly observes
that if we have a flag for code point mode, then the transformation is
not needed - old regular expressions would continue to work in code unit
mode, while new regular expressions with /u get code point treatment.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to