On 2023-04-20 19:04, Carlo Marcelo Arenas Belón wrote:
All versions of PCRE2 that include PCRE2_MATCH_INVALID_UTF had a bug on
its JIT implementation that results in failure to match for the negative
perl classes, and seems to be easier to replicate when the matching
character is a multibyte one.

Unfortunately that is a little vague. I expect the issue is not limited to \D and \W, as there are other ways to specify negative Perl classes. And if the bug merely seems to be easier to replicate with multibyte characters, it sounds like we may have issues even when matching ASCII characters in a UTF-8 locale.

Furthermore, I'm leery of optimizing for PCRE2 10.42 and earlier. We should focus our optimization efforts on future PCRE2 versions, and not worry about optimizing earlier versions where optimizations complicate maintenance for a declining benefit, and are likely to provoke bugs in older versions that as time passes will be harder to debug.


Alternatively JIT could be disabled instead, but the option selected has
less of an impact on performance.

Disabling JIT sounds better, as correctness trumps performance. Until the bug is fixed (or at least better-understood so that we have a workaround we can trust), how about the attached patch instead?
From 4ec71b63f9ac0bb27b60e1c9802edcba868099e8 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Fri, 21 Apr 2023 11:31:12 -0700
Subject: [PATCH] grep: use PCRE2 JIT only in unibyte locales

* src/pcresearch.c (Pcompile): Call pcre2_jit_compile only
if in a multibyte locale, to work around a PCRE2 JIT bug.
---
 NEWS             |  4 ++++
 src/pcresearch.c | 17 +++++++++++------
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index f16c576..b9b8cda 100644
--- a/NEWS
+++ b/NEWS
@@ -11,6 +11,10 @@ GNU grep NEWS                                    -*- outline -*-
   Unicode interpretations.
   [bug introduced in grep 3.10]
 
+  With -P, patterns like \D and \W now work again in a UTF-8 locale,
+  when linked to PCRE2 10.34 or newer.
+  [bug introduced in grep 3.8]
+
   grep no longer fails on files dated after the year 2038,
   when running on 32-bit x86 and ARM hosts using glibc 2.34+.
   [bug introduced in grep 3.9]
diff --git a/src/pcresearch.c b/src/pcresearch.c
index e82bf86..4086bbc 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -243,13 +243,18 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
   pc->mcontext = NULL;
   pc->data = pcre2_match_data_create_from_pattern (pc->cre, gcontext);
 
-  /* Ignore any failure return from pcre2_jit_compile, as that merely
-     means JIT won't be used during matching.  */
-  pcre2_jit_compile (pc->cre, PCRE2_JIT_COMPLETE);
+  /* Do not use PCRE2 JIT in multibyte locales <https://bugs.gnu.org/62983>.
+     FIXME: when the PCRE2 bug is fixed or a reliable workaround found.  */
+  if (!localeinfo.multibyte)
+    {
+      /* Ignore any failure return from pcre2_jit_compile, as that merely
+         means JIT won't be used during matching.  */
+      pcre2_jit_compile (pc->cre, PCRE2_JIT_COMPLETE);
 
-  /* The PCRE documentation says that a 32 KiB stack is the default.  */
-  pc->jit_stack = NULL;
-  pc->jit_stack_size = 32 << 10;
+      /* The PCRE documentation says that a 32 KiB stack is the default.  */
+      pc->jit_stack = NULL;
+      pc->jit_stack_size = 32 << 10;
+    }
 
   pc->empty_match[false] = pcre_exec (pc, "", 0, 0, PCRE2_NOTBOL);
   pc->empty_match[true] = pcre_exec (pc, "", 0, 0, 0);
-- 
2.39.2

Reply via email to