+1000 for me.

Cheers.

On Fri, 25 Jul 2025 at 23:20, Niels Dossche <dossche.ni...@gmail.com> wrote:

> Hi internals
>
> On PHP 8.5-dev, we ship with pcre2lib 10.45.
>
> This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS".
> It enables the use of complex character set operations in accordance to
> UTS#18 (Unicode Technical Standard 18).
> This means it becomes possible to nest character sets, perform set
> operations on them, etc.
> One example of such a set operation is a set subtraction, e.g. the regex
> "[\ep{L}--[QW]]" means "Unicode letters other than Q and W".
> Or a more realistic example (inspired from [1]): the regex
> "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers.
> You can also do ORs, ANDs, etc.
>
> The reason this is opt-in in pcre2lib, is because the interpretation of
> existing regexes may change.
> This standard is being adopted in other languages too, also opt-in, for
> example in JavaScript [1].
> To expose this functionality in PHP, we also have to make it opt-in via a
> modifier.
>
> In JavaScript, this is enabled via the /v modifier at the end of the regex
> [1].
> This does the same thing as the /u modifier, but extends it with this
> UTS#18 standard.
> We also already have /u in PHP that enables UTF-8 unicode mode. So we
> could do the same as JavaScript and add a /v modifier that extends /u and
> also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode
> processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a
> unicode standard (and that at least JavaScript does this too), it may make
> sense to enable them both.
>
> The actual patch is trivial:
> ```diff
> diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
> index 8e0fb2cce5f..4a4727545ad 100644
> --- a/ext/pcre/php_pcre.c
> +++ b/ext/pcre/php_pcre.c
> @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry*
> pcre_get_compiled_regex_cache_ex(zend_string *regex, bo
>                         case 'S':       /* Pass. */
>              break;
>                         case 'X':       /* Pass. */
>              break;
>                         case 'U':       coptions |= PCRE2_UNGREEDY;
>      break;
> +#ifdef PCRE2_ALT_EXTENDED_CLASS
> +                       case 'v':       coptions |=
> PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH;
> +#endif
>                         case 'u':       coptions |= PCRE2_UTF;
>         /* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize
> only ASCII
>            characters, even in UTF-8 mode. However, this can be changed by
> setting
>
> ```
>
> What do we think?
>
> [1] https://github.com/tc39/proposal-regexp-v-flag
>
> Kind regards
> Niels
>

Reply via email to