Hi there,

The attached patch is a port of the most recent commit in the irregex
repository, which fixes this upstream ticket:
https://github.com/ashinn/irregex/issues/27

Cheers,
Peter
From b552052f4085e84d662f70bb76cb4abf41ab25bc Mon Sep 17 00:00:00 2001
From: Peter Bex <[email protected]>
Date: Mon, 5 Jul 2021 11:38:43 +0200
Subject: [PATCH] Update irregex to upstream 960fa22b, fixing a group matching
 issue

When a kleene star is used around an alternative containing
submatches, in some circumstances the DFA compilation would emit
reordering commands which would cause the regex capturing to go wrong,
returning faulty matches.

This would go wrong because the ordering commands would read from a
memory slot and write to a target memory slot.

For example, the following set of reordering commands has no "correct"
order in which they can be executed:

p[0] <- p[1]
p[1] <- p[0]

After executing both of them in either order, both of the slots will
contain the same value, instead of swapping them as was the intention.
This is fixed by executing the ordering commands after first fetching
the old memory slot locations into a closure.

Fixes upstream issue #27
---
 NEWS               |  4 +++-
 irregex-core.scm   | 18 ++++++++++++------
 tests/re-tests.txt |  1 +
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/NEWS b/NEWS
index 46af9bd1..53a40f0f 100644
--- a/NEWS
+++ b/NEWS
@@ -10,9 +10,11 @@
     of irregex-replace/all with positive lookbehind so all matches are
     replaced instead of only the first (reported by Kay Rhodes), and
     a regression regarding replacing empty matches which was introduced
-    by the fixes in 0.9.7 (reported by Sandra Snan).  Finally, the
+    by the fixes in 0.9.7 (reported by Sandra Snan).  Also, the
     http-url shorthand now allows any top-level domain and the old
     "top-level-domain" now also supports "edu" (fixed by Sandra Snan).
+    Finally, a problem was fixed with capturing groups inside a kleene
+    star, which could sometimes return incorrect parts of the match.
   - current-milliseconds has been deprecated in favor of the name
     current-process-milliseconds, to avoid confusion due to naming
     of current-milliseconds versus current-seconds, which do something
diff --git a/irregex-core.scm b/irregex-core.scm
index 8f672333..a8e7c97f 100644
--- a/irregex-core.scm
+++ b/irregex-core.scm
@@ -2235,12 +2235,18 @@
                                         (chunk&position (cons src (+ i 1))))
                                     (vector-set! slot (car s) chunk&position)))
                                 (cdr cmds))
-                      (for-each (lambda (c)
-                                  (let* ((tag (vector-ref c 0))
-                                         (ss (vector-ref memory (vector-ref c 1)))
-                                         (ds (vector-ref memory (vector-ref c 2))))
-                                    (vector-set! ds tag (vector-ref ss tag))))
-                                (car cmds)))))
+		      ;; Reassigning commands may be in an order which
+                      ;; causes memory cells to be clobbered before
+                      ;; they're read out.  Make 2 passes to maintain
+                      ;; old values by copying them into a closure.
+                      (for-each (lambda (execute!) (execute!))
+                                (map (lambda (c)
+                                       (let* ((tag (vector-ref c 0))
+                                              (ss (vector-ref memory (vector-ref c 1)))
+                                              (ds (vector-ref memory (vector-ref c 2)))
+                                              (value-from (vector-ref ss tag)))
+                                         (lambda () (vector-set! ds tag value-from))))
+                                     (car cmds))))))
                   (if new-finalizer
                       (lp2 (+ i 1) next src (+ i 1) new-finalizer)
                       (lp2 (+ i 1) next res-src res-index #f))))
diff --git a/tests/re-tests.txt b/tests/re-tests.txt
index 7a56edb7..39a747e6 100644
--- a/tests/re-tests.txt
+++ b/tests/re-tests.txt
@@ -171,3 +171,4 @@ multiple words	multiple words, yeah	y	&	multiple words
 (a([^a])*)*	abcaBC	y	&-\1-\2	abcaBC-aBC-C
 ([Aa]b).*\1	abxyzab	y	&-\1	abxyzab-ab
 a([\/\\]*)b	a//\\b	y	&-\1	a//\\b-//\\
+(?:[[:alnum:]]|(@[[:alnum:]]))*	oeh@2tu@2n342	y	\1	@2
-- 
2.20.1

Attachment: signature.asc
Description: PGP signature

Reply via email to