BTW, none of these are indexed solutions AFAICT, so they are all O(N) over the possible matches at lookup time.
If you need something with better lookup times, the standard solution is to do n-gram (by character) of the inputs, and index those. You can also index substrings. At lookup time you do the same thing to the input string. The number of matches and the size of n compared to the length of the input and candidate strings can give you a lower bound on the edit distance. For example: "the quick brown fox" "the quick brown box" the edit distance is 1 here if we choose an n of 5, the sequences will be "the q", "he qu" "e qui" " quic" "quick" etc. If we match at least one we know the edit distance is at most length(input) (19) - n (5) = 14. Otoh, scanning: "i like cheese" is probably meaningless, since they aren't similar at all. Once you take more matches into account it's not as clear cut, obviously, but hopefully if the database is huge then you at least have a much smaller candidate list, and you can further sort it by a number of things (i.e. prioritize substrings with fewer candidate matches, prioritize candidates which match more substrings, etc), before you compute the actual edit distance. The target edit distance lets you choose a threshold, and the higher it is the longer the substrings can be and the more you can prune from the search space. Another similar but simpler approach is to use the nilsimsa hash and index substrings for collisions. This doesn't require n-grams because the data is always aligned. Here are the nilsimsa hashes for the above strings, and another string with an offset change: "the quick brown fox" 0a31b4be01a0808a29e0ec60e9e258545dc05267700223082a0a2128708b2edb "the quick brown box" 0a31b4bf05a08088a9e0ed60e9c858505dc05266700043182a080128688b2edb "a quick brown box" 181034bf05208088a1a4ed00e9c850105dc04266700063182a080128688b2edb "i like cheese" 1aa468101a210c0408b48130006010140047640443804a028042e01c8000402a Then you can split the hashes up into substrings, for instance: The first and third hashes split up to 32 bit substrings have 3 exact matches: 0a31b4bf 05a08088 a9e0ed60 e9c85850 5dc05266 70004318 2a080128 688b2edb 181034bf 05208088 a1a4ed00 e9c85010 5dc04266 70006318 2a080128 688b2edb Wheras "i like cheese" has no common substrings with either 1aa46810 1a210c04 08b48130 00601014 00476404 43804a02 8042e01c 8000402a A trie would be a good fit for such an index, because even partial matches are meaningful, and the data is more compact. Again, depending on the length of the substrings and the number of exactly matching substrings you have some level of confidence about the overall similarity of the inputs. _______________________________________________ Perl mailing list [email protected] http://mail.perl.org.il/mailman/listinfo/perl
