On Sun, Feb 27, 2011 at 12:50 PM, Jordi Boggiano <j.boggi...@seld.be> wrote:

> http://volnitsky.com/project/str_search/


The algorithm seems flawed to me, at least in its reference implementation.
There does not seem to be any guarantee that the returned position is really
the *first* occurrence of the needle in the haystack.

It's easy to see with needle being a repetition of the same character:
  SS = 'a' * 1000
  W_size = 4

The hash map build will be clogged, with the value 1 stored in the first
slot for SS. As a consequence, the algorithm will step through the haystack,
trying to confirm a match for the needle at the step position. If a match is
found there, it will discard any previous matches that could be valid at
this position. All those haystack will return the same position 998:
  S = 'bbbb' * 997 + 'a' * 1000 (correct)
  S = 'bbb' * 996 + 'a' * 1001 (incorrect, should return 997)
  S = 'bb' * 995 + 'a' * 1002 (incorrect, should return 996)
  S = 'b' * 994 + 'a' * 1003 (incorrect, should return 995)

The implementation could be fixed (by adding an explicit string matching
when building the hash table, and by storing *all* the occurrences of a
given W in SS), but that will increase the overall cost (both computing and
memory) of the algorithm.

Damien Tournoud

Reply via email to