Re: Q: Lonely surrogates and unicode regexps

Norbert Lindenberg Sat, 31 Jan 2015 00:40:33 -0800

> On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock <al...@wirfs-brock.com> wrote:
> 
> 
> On Jan 28, 2015, at 4:54 AM, Wes Garland <w...@page.ca> wrote:
>


>> Do we extend the regexp syntax to have a symbol which matches an unmatched 
>> surrogate?
> we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

>>   How about reserved code points?  What happens when they become assigned?
> Other than the initial decoding of valid surrogate pairs into 32-bit code 
> points, the ES6 //u RegExp spec. applies no semantics to any code points in 
> the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the 
comparison involves code points that are unassigned in the Unicode version 
assumed by an ECMAScript implementation and in a later version get assigned to 
characters that are case-variants of each other, then the RegExp behavior can 
change. See section 21.2.2.8.2.

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W 
are defined by character lists that cannot change, but \s and therefore \S 
could change if Unicode assigns new characters with the category “Separator, 
space”. See section 21.2.2.12.

But in general //u is defined based on code points and doesn’t care whether 
code points are assigned or reserved.

Norbert

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to