On Feb 7, 10:37 pm, [EMAIL PROTECTED] wrote: > > On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote: > > > > Jeff Schwab wrote: > > ... > > >> If the strings happen to be the same length, the Levenshtein distance > > >> is equivalent to the Hamming distance. > > Is this really what the OP was asking for. If I understand it correctly, > Levenshtein distance works out the number of edits required to transform > the string to the target string. The smaller the more equivalent, but with > the OP's problem I would expect > > table1 table2 > brian briam > erian > > I think the OP would like to guess at 'briam' rather than 'erian', but > Levenstein would rate them equally good guesses? > > I know this is pushing it more toward phonetic alaysis of the words or > something similar, and thats orders of magnitude more complex. >
Not very. The edit distance idea can be generalised by having variable penalties for replacement and for insertion/deletion. E.g. n/m has a low replacement penalty because they're both phonetically very similar AND adjacent on some keyboards. Google "zobel editex" for some ideas. Insertion/deletion: a good tweak is to use a low (even zero) penalty for omitting a doubled letter e.g. Matthew / Mathew. Google "febrl" for a Python package for record matching -- the authors have a recent paper where they compare various name-matching methods. HTH, John > This message [big snip] has astonishingly large multi-lingual carbuncles on its rump. Please consider posting from home. > Ce message et toutes les pieces jointes (ci-apres le [big snip] -- http://mail.python.org/mailman/listinfo/python-list