> Would anyone know of any prior art for detection of "short edit distances"?  
> (Perhaps even already on CPAN?)

As David & Zefram pointed out, Levenshtein is the classic algorithm for this, 
but there are plenty of others; in the SEE ALSO for Text::Levenshtein I’ve 
listed at least some of the ones I know of on CPAN:
        https://metacpan.org/pod/Text::Levenshtein#SEE-ALSO

A better algorithm for this purpose is the Damerau-Levenshtein edit distance:
Classic Levenshtein counts the number of insertions, deletions, and 
substitutions needed to get from one string to the other. Comparing 
"Algorithm::SVM" and "Algorithm::VSM” gives an edit distance of 2.
The Damerau variant adds transpositions of adjacent characters. This results in 
an edit distance of 1 for the example above, which is how my script found it.

I used Text::Levenshtein::Damerau::XS, because it’s quicker. That’s how I found 
the examples I gave yesterday.

I’ll tweak my script to not worry about packages in the same distribution (eg 
Acme::Flat::GV and Acme::Flat::HV). Then I just need to get a list of new 
packages each day, and I’m just about there :-)

Neil

Reply via email to