Hi all, I had a look at #878, and though this is mind-bending code (at least, for my feeble mind) it's pretty clear what's gone wrong if you compare the code to the reference implementation.
Of course, the question then becomes which is right, ours or Olin's. To verify that more easily because these vectors are a little hard to grok directly, I made a test case which is easier to step through in your head. The idea behind kmp-search is explained in the SRFI document, but it was pretty unclear to me all this time, until I was "forced" to understand for this bug, so let me try and explain it a bit here. The idea is that you look at a search string character-by-character without any backtracking. The pattern string you're matching against then remembers some state so that it knows where in the pattern you are. For example, if we match "abcabe" against the pattern "abe", we get this: start: _abe read a => a_be read b => ab_e read c => _abe (reset) read a => a_be read b => ab_e read e => abe_ (done) We know we're done because we arrived at the pattern's final index. If there's still data waiting on the stream, we need to decide how the search is anchored. This is essentially the same as the DFA regex matching performed in irregex; it's a simple state machine. Where it gets trickier is when the pattern contains repetitions. For example, the pattern "abac" matches against search string "ababac" like this: start: _abac read a => a_bac read b => ab_ac read a => aba_c read b => ab_ac (reset, but to position 2!) read a => aba_c read c => abac_ (done) As you can see, the "magic" here is in how it determines to jump back to position 2, it somehow "knows" that the string "ab" is a suffix of the complete string read so far, so it won't have to re-check that part of the pattern. The restart-vector is simply a representation of the state transition diagram for this state machine. The code which constructs this restart vector is mind-bending, and I still don't fully understand it. It somehow seems to loop in parallel, simultaneously stepping through three indices. The only really simple bit is the "k" variable, which is the "main" loop, through the pattern like k = start..end. At the same time it loops through the vector using i = 0..len and using j, which uses the positions in the restart vector constructed so far (between 0..i, or at least that's what I *think*). How and what exactly doesn't matter so much right now, because the tests I added clearly show that the current CHICKEN code is simply wrong. The reference implementation passes the tests, and really differs only in two places, where the ck variable is used (which does not exist in the reference implementation). ck represents the character at position k in the pattern, but the problem is that in the reference implementation it indexes the string in one place at k+1 and in the other place at k. The CHICKEN version indexes the string in both places at k. The patch simply reverts this attempted optimisation to behave like the reference implementation again. Cheers, Peter -- http://www.more-magic.net
>From e154cabb2171e16319866c5fce6ee3b50b73da0a Mon Sep 17 00:00:00 2001 From: Peter Bex <[email protected]> Date: Sun, 22 Jun 2014 18:02:33 +0200 Subject: [PATCH] Fix #878 which was indeed a bug, caused by an incorrect hand-rolled optimisation. This adds some more "integration" test cases so that we can verify more easily that the combination of make-kmp-restart-vector and kmp-step is looping through the pattern correctly as it advances through the search string. The optimisation was due to a mistaken reading of the reference implementation: the pattern was indexed as pat[k+1] at the j=-1 case and as pat[k] at the pat[k]=pat[j+start] case, but the optimisation changed the code to use pat[k] in both cases. --- NEWS | 1 + srfi-13.scm | 43 +++++++++++++++-------------- tests/srfi-13-tests.scm | 70 +++++++++++++++++++++++++++++++++++++++-------- 3 files changed, 82 insertions(+), 32 deletions(-) diff --git a/NEWS b/NEWS index a9ded9e..693ee40 100644 --- a/NEWS +++ b/NEWS @@ -6,6 +6,7 @@ - Unit tcp now implicitly depends on ports instead of extras. This may break programs which don't use modules and forgot to require extras but use procedures from it. + - Fixed bug in make-kmp-restart-vector from SRFI-13. - Unit lolevel: - Restore long-lost but still documented "vector-like?" procedure (#983) diff --git a/srfi-13.scm b/srfi-13.scm index 65b748f..dec54b2 100644 --- a/srfi-13.scm +++ b/srfi-13.scm @@ -1404,35 +1404,36 @@ ((c= char=?) rest) ; (procedure? c=)) (receive (rest2 start end) (string-parse-start+end make-kmp-restart-vector pattern rest) (let* ((rvlen (- end start)) - (rv (make-vector rvlen -1))) + (rv (make-vector rvlen -1))) (if (> rvlen 0) (let ((rvlen-1 (- rvlen 1)) (c0 (string-ref pattern start))) ;; Here's the main loop. We have set rv[0] ... rv[i]. ;; K = I + START -- it is the corresponding index into PATTERN. - (let lp1 ((i 0) (j -1) (k start)) + (let lp1 ((i 0) (j -1) (k start)) (if (< i rvlen-1) - (let ((ck (string-ref pattern k))) - ;; lp2 invariant: - ;; pat[(k-j) .. k-1] matches pat[start .. start+j-1] - ;; or j = -1. - (let lp2 ((j j)) - - (cond ((= j -1) - (let ((i1 (+ i 1))) - (vector-set! rv i1 (if (c= ck c0) -1 0)) - (lp1 i1 0 (+ k 1)))) - - ;; pat[(k-j) .. k] matches pat[start..start+j]. - ((c= ck (string-ref pattern (+ j start))) - (let* ((i1 (+ 1 i)) - (j1 (+ 1 j))) - (vector-set! rv i1 j1) - (lp1 i1 j1 (+ k 1)))) - - (else (lp2 (vector-ref rv j)))))))))) + ;; lp2 invariant: + ;; pat[(k-j) .. k-1] matches pat[start .. start+j-1] + ;; or j = -1. + (let lp2 ((j j)) + + (cond ((= j -1) + (let ((i1 (+ i 1)) + (ck+1 (string-ref pattern (add1 k)))) + (vector-set! rv i1 (if (c= ck+1 c0) -1 0)) + (lp1 i1 0 (+ k 1)))) + + ;; pat[(k-j) .. k] matches pat[start..start+j]. + ((c= (string-ref pattern k) + (string-ref pattern (+ j start))) + (let* ((i1 (+ 1 i)) + (j1 (+ 1 j))) + (vector-set! rv i1 j1) + (lp1 i1 j1 (+ k 1)))) + + (else (lp2 (vector-ref rv j))))))))) rv)))) diff --git a/tests/srfi-13-tests.scm b/tests/srfi-13-tests.scm index 1262b82..5016b9c 100644 --- a/tests/srfi-13-tests.scm +++ b/tests/srfi-13-tests.scm @@ -607,17 +607,65 @@ (test "make-kmp-restart-vector" '#(-1) (make-kmp-restart-vector "a")) -;;; The following two tests for make-kmp-restart-vector are -;;; intentionally commented (see http://bugs.call-cc.org/ticket/878) -;;; -- mario - -; This seems right to me, but is it? -; (test "make-kmp-restart-vector" '#(-1 0) (make-kmp-restart-vector "ab")) - -; The following is from an example in the code, but I expect it is not right. -; (test "make-kmp-restart-vector" '#(-1 0 0 -1 1 2) (make-kmp-restart-vector "abdabx")) - - +(test "make-kmp-restart-vector" '#(-1 0) (make-kmp-restart-vector "ab")) + +; The following is from an example in the code. It is the "optimised" +; version; it's also valid to return #(-1 0 0 0 1 2), but that will +; needlessly check the "a" twice before giving up. +(test "make-kmp-restart-vector" + '#(-1 0 0 -1 1 2) + (make-kmp-restart-vector "abdabx")) + +;; Each entry in kmp-cases is a pattern, a string to match against and +;; the expected run of the algorithm through the positions in the +;; pattern. So for example 0 1 2 means it looks at position 0 first, +;; then at 1 and then at 2. +;; +;; This is easy to verify in simple cases; If there's a shared +;; substring and matching fails, you try matching again starting at +;; the end of the shared substring, otherwise you rewind. For more +;; complex cases, it's increasingly difficult for humans to verify :) +(define kmp-cases + '(("abc" "xx" #f 0 0) + ("abc" "abc" #t 0 1 2) + ("abcd" "abc" #f 0 1 2) + ("abc" "abcd" #t 0 1 2) + ("abc" "aabc" #t 0 1 1 2) + ("ab" "aa" #f 0 1) + ("ab" "aab" #t 0 1 1) + ("abdabx" "abdbbabda" #f 0 1 2 3 0 0 1 2 3) + ("aabc" "axaabc" #t 0 1 0 1 2 3) + ("aabac" "aabaabac" #t 0 1 2 3 4 2 3 4))) + +(for-each + (lambda (test-case) + (let* ((pat (car test-case)) + (n (string-length pat)) + (str (cadr test-case)) + (match? (caddr test-case)) + (steps (cdddr test-case)) + (rv (make-kmp-restart-vector pat))) + (call-with-input-string + str + (lambda (p) + (let lp ((i 0) + (step 0) + (steps steps)) + (cond + ((or (= i n) (eof-object? (peek-char p))) + (test-assert (sprintf "KMP match? ~S, case: ~S" match? test-case) + (eq? (= i n) match?)) + (test-assert (sprintf "KMP empty remaining steps: ~S, case: ~S" + steps test-case) + (null? steps))) + (else + (let ((new-i (kmp-step pat rv (read-char p) i char=? 0)) + (expected-i (and (not (null? steps)) (car steps)))) + (test (sprintf "KMP step ~S (exp: ~S, act: ~S), case: ~S" + step expected-i i test-case) + expected-i i) + (lp new-i (add1 step) (cdr steps)))))))))) + kmp-cases) ; FIXME! Implement tests for these: ; string-kmp-partial-search -- 1.7.10.4
_______________________________________________ Chicken-hackers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-hackers
