>>>>> "ew" == Edi Weitz <[EMAIL PROTECTED]> writes:

  ew> I was aware of Boyer-Moore although I'm not really familiar with
  ew> it. I'll try to implement it this weekend and let the list know
  ew> whether there's a significant difference. (I guess you first
  ew> have to check the size of the strings in order to decide if it
  ew> is worth building the translation tables.)

it's certainly worth it when the string you're searching through is
long; CMUCL's implementation of SEARCH does a naive sequential search. 
Below with a string of length 2M, Boyer-Moore is 100 times faster
than SEARCH (ignoring time spent compiling the match function). 
The win will be biggest with multiple searchs for the same long
pattern. 
  
,---- CMUCL compiled yesterday ---
| USER> (defvar *haystack* (make-string 1000000))
| USER> (dotimes (i (length *haystack*)) (setf (char *haystack* i) (code-char (random 
|128))))
| USER> (progn (setf *haystack* (concatenate 'string *haystack* "The needle" 
|*haystack*)) (values))
| USER> (compile 'make-string-matcher)
| USER> (defvar *matcher* (make-string-matcher "The needle"))
| #<Closure Over Function "lambda (needle)" {48B6C3E1}>
| USER> (time (dotimes (i 10) (search "The needle" *haystack* :test #'char=)))
| Compiling lambda nil: 
| Compiling Top-Level Form: 
| Evaluation took:
|   3.33 seconds of real time
|   3.24 seconds of user run time
|   0.09 seconds of system run time
|   [Run times include 0.13 seconds GC run time]
|   0 page faults and
|   80568264 bytes consed.
| nil
| USER> (time (dotimes (i 10) (funcall *matcher* *haystack*)))
| Compiling lambda nil: 
| Compiling Top-Level Form: 
| Evaluation took:
|   0.03 seconds of real time
|   0.03 seconds of user run time
|   0.0 seconds of system run time
|   0 page faults and
|   0 bytes consed.
| nil
`----

;; Implementation of the Boyer-Moore string search algorithm, for
;; patterns up to length 2048. This is a very fast way of searching
;; for substrings, whose efficiency relative to other algorithms
;; increases as the length of the search pattern increases.
;;
;; Reference: "A Fast String Searching Algorithm", J. Moore and R.
;; Boyer. Communications of the Association for Computing Machinery,
;; 20(10), 1977, pp. 762-772.
(defun make-string-matcher (needle)
  (declare (simple-string needle))
  (if (zerop (length needle))
      (lambda (haystack) (declare (ignore haystack)) 0)

      (lambda (haystack)
        (declare (optimize (speed 3) (safety 0) (space 0) (debug 0))
                 (simple-string haystack))
        (let* ((n (1- (length haystack)))
               (m (1- (length needle)))
               (skip (make-array char-code-limit
                                 :initial-element (1+ m)
                                 :element-type '(integer 0 2048))))
          (labels ((outer (i) (declare (fixnum i)) (and (<= i n) (inner i m i)))
                   (inner (k j i)
                     (declare (fixnum k j i))
                     (if (char= (char haystack k) (char needle j))
                         (if (zerop j) k (inner (1- k) (1- j) i))
                         (outer (+ i (aref skip (char-code (char haystack i))))))))
            ;; initialize the skip table
            (dotimes (j m)
              (setf (aref skip (char-code (char needle j))) (- m j)))
            (outer m))))))
    

  ew> PS: However, the question arises why CMUCL doesn't use this algorithm
  ew> for its implementation of SEARCH.

it could be done with an appropriate DEFTRANSFORM (see below). Some
tuning work needs to be done to decide when it's worth compiling the
match function, and it should also be possible to adapt the
MAKE-STRING-MATCHER function to handle start and end delimiters.


(deftransform search ((string1 string2 &key (start1 0) end1 (start2 0) end2
                               test)
                      (simple-string simple-string &rest t))
  (unless (and (zerop start1)
               (null end1)
               (zerop start2)
               (null end2))
    (give-up))
  '(funcall (make-string-matcher string1) string2))
  
-- 
Eric Marsden                          <URL:http://www.laas.fr/~emarsden/>

Reply via email to