On Sun, Feb 27, 2011 at 12:50 PM, Jordi Boggiano <j.boggi...@seld.be> wrote:
> http://volnitsky.com/project/str_search/ The algorithm seems flawed to me, at least in its reference implementation. There does not seem to be any guarantee that the returned position is really the *first* occurrence of the needle in the haystack. It's easy to see with needle being a repetition of the same character: SS = 'a' * 1000 W_size = 4 The hash map build will be clogged, with the value 1 stored in the first slot for SS. As a consequence, the algorithm will step through the haystack, trying to confirm a match for the needle at the step position. If a match is found there, it will discard any previous matches that could be valid at this position. All those haystack will return the same position 998: S = 'bbbb' * 997 + 'a' * 1000 (correct) S = 'bbb' * 996 + 'a' * 1001 (incorrect, should return 997) S = 'bb' * 995 + 'a' * 1002 (incorrect, should return 996) S = 'b' * 994 + 'a' * 1003 (incorrect, should return 995) The implementation could be fixed (by adding an explicit string matching when building the hash table, and by storing *all* the occurrences of a given W in SS), but that will increase the overall cost (both computing and memory) of the algorithm. Damien Tournoud