On 25 July 2025 23:17:43 BST, Niels Dossche <dossche.ni...@gmail.com> wrote:
>Hi internals
>
>On PHP 8.5-dev, we ship with pcre2lib 10.45.
>
>This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS".
>It enables the use of complex character set operations in accordance to UTS#18 
>(Unicode Technical Standard 18).
>This means it becomes possible to nest character sets, perform set operations 
>on them, etc.
>One example of such a set operation is a set subtraction, e.g. the regex 
>"[\ep{L}--[QW]]" means "Unicode letters other than Q and W".
>Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9]]" 
>matches all non-ASCII unicode numbers.
>You can also do ORs, ANDs, etc.
>
>The reason this is opt-in in pcre2lib, is because the interpretation of 
>existing regexes may change.
>This standard is being adopted in other languages too, also opt-in, for 
>example in JavaScript [1].
>To expose this functionality in PHP, we also have to make it opt-in via a 
>modifier.
>
>In JavaScript, this is enabled via the /v modifier at the end of the regex [1].
>This does the same thing as the /u modifier, but extends it with this UTS#18 
>standard.
>We also already have /u in PHP that enables UTF-8 unicode mode. So we could do 
>the same as JavaScript and add a /v modifier that extends /u and also enables 
>PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode processing for 
>enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard 
>(and that at least JavaScript does this too), it may make sense to enable them 
>both.
>
>The actual patch is trivial:
>```diff
>diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
>index 8e0fb2cce5f..4a4727545ad 100644
>--- a/ext/pcre/php_pcre.c
>+++ b/ext/pcre/php_pcre.c
>@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* 
>pcre_get_compiled_regex_cache_ex(zend_string *regex, bo
>                       case 'S':       /* Pass. */                             
>         break;
>                       case 'X':       /* Pass. */                             
>         break;
>                       case 'U':       coptions |= PCRE2_UNGREEDY;             
> break;
>+#ifdef PCRE2_ALT_EXTENDED_CLASS
>+                      case 'v':       coptions |= PCRE2_ALT_EXTENDED_CLASS; 
>ZEND_FALLTHROUGH;
>+#endif
>                       case 'u':       coptions |= PCRE2_UTF;
>       /* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only 
> ASCII
>          characters, even in UTF-8 mode. However, this can be changed by 
> setting
>
>```
>
>What do we think?
>
>[1] https://github.com/tc39/proposal-regexp-v-flag



Yes, please.

cheers
Derick

Reply via email to