+1000 for me. Cheers.
On Fri, 25 Jul 2025 at 23:20, Niels Dossche <dossche.ni...@gmail.com> wrote: > Hi internals > > On PHP 8.5-dev, we ship with pcre2lib 10.45. > > This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS". > It enables the use of complex character set operations in accordance to > UTS#18 (Unicode Technical Standard 18). > This means it becomes possible to nest character sets, perform set > operations on them, etc. > One example of such a set operation is a set subtraction, e.g. the regex > "[\ep{L}--[QW]]" means "Unicode letters other than Q and W". > Or a more realistic example (inspired from [1]): the regex > "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers. > You can also do ORs, ANDs, etc. > > The reason this is opt-in in pcre2lib, is because the interpretation of > existing regexes may change. > This standard is being adopted in other languages too, also opt-in, for > example in JavaScript [1]. > To expose this functionality in PHP, we also have to make it opt-in via a > modifier. > > In JavaScript, this is enabled via the /v modifier at the end of the regex > [1]. > This does the same thing as the /u modifier, but extends it with this > UTS#18 standard. > We also already have /u in PHP that enables UTF-8 unicode mode. So we > could do the same as JavaScript and add a /v modifier that extends /u and > also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode > processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a > unicode standard (and that at least JavaScript does this too), it may make > sense to enable them both. > > The actual patch is trivial: > ```diff > diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c > index 8e0fb2cce5f..4a4727545ad 100644 > --- a/ext/pcre/php_pcre.c > +++ b/ext/pcre/php_pcre.c > @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* > pcre_get_compiled_regex_cache_ex(zend_string *regex, bo > case 'S': /* Pass. */ > break; > case 'X': /* Pass. */ > break; > case 'U': coptions |= PCRE2_UNGREEDY; > break; > +#ifdef PCRE2_ALT_EXTENDED_CLASS > + case 'v': coptions |= > PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH; > +#endif > case 'u': coptions |= PCRE2_UTF; > /* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize > only ASCII > characters, even in UTF-8 mode. However, this can be changed by > setting > > ``` > > What do we think? > > [1] https://github.com/tc39/proposal-regexp-v-flag > > Kind regards > Niels >