Bruno P. Kinoshita created TEXT-109:
---------------------------------------
Summary: Implement or document how to use edit distances that
consider the keyboard layout
Key: TEXT-109
URL: https://issues.apache.org/jira/browse/TEXT-109
Project: Commons Text
Issue Type: New Feature
Reporter: Bruno P. Kinoshita
Priority: Minor
Most edit distances take into consideration number of "changes" required in one
string to match with another string. And they give you a value that represent
the distance between the words.
While it is helpful, when working with datasets and corpora that have been
created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have
mistakes. In some cases a letter was accidentally mistyped. But the character
used is normally close to the correct character.
For example, given the word "one", and two incorrect misspellings "onr" and
"oni". The Levenshtein distance for both would be 1. But if you are aware that
the keyboard layout is English with the QUERTY layout (notice the E and the R),
so the distance between "one" and "onr", would be greater than the distance
between "one" and "oni", because in the English keyboard the letter 'E' is
neighbouring 'R'. Whereas 'I' is not even covered by the left hand, but by the
right hand.
Here's some reference links for further research.
* https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
* https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
* http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
* https://github.com/wsong/Typo-Distance
*
https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
Ideally such edit distance would be extensible to support other keyboard
layouts.
There is some indication that perhaps an existing edit distance like
levenshtein could be extended to take into consideration the keyboard layout.
So perhaps a new edit distance is not entirely necessary.
We could come with the the decision that it is too hard to implement, and it
would be better done in a spell checker, or that it would require some
statistics and would be out of the scope of Text. Or we could simply add
documentation on how to do it, without adding any code. Or, perhaps we add a
new edit distance.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)