Re: Q: Lonely surrogates and unicode regexps

André Bargull Wed, 28 Jan 2015 03:58:24 -0800

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org  
<https://mail.mozilla.org/listinfo/es-discuss>> wrote:


>/  The ES6 unicode regexp spec is not very clear regarding what should happen
/>/  if the regexp or the matched string contains lonely surrogates (a lead
/>/  surrogate without a trail, or a trail without a lead). For example, for the
/>/  . operator, the relevant parts of the spec speak about characters:
/>/
/
Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous
term "code point". For example, U+0378 is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term "character" instead of "code point", then at
some point in the text it needs to disambiguate what is meant.


"character" is defined in 21.2.2 Pattern Semantics [1]:

In the context of describing the behaviour of a BMP pattern “character” means a single 16-bitUnicode BMP code point. In the context of describing the behaviour of a Unicode pattern“character” means a UTF-16 encoded code point.



[1] https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to