Re: Q: Lonely surrogates and unicode regexps

Mark Davis ☕️ Wed, 28 Jan 2015 02:51:06 -0800

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <ma...@chromium.org> wrote:


> The ES6 unicode regexp spec is not very clear regarding what should happen
> if the regexp or the matched string contains lonely surrogates (a lead
> surrogate without a trail, or a trail without a lead). For example, for the
> . operator, the relevant parts of the spec speak about characters:
>

Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous
term "code point". For example, U+0378 is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term "character" instead of "code point", then at
some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking
at Java's approach.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to