[ https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364542#comment-16364542 ]
Mark Dacek commented on TEXT-109: --------------------------------- Hello. I discussed this with [~chtompki] today. We will be attempting this shortly. > Implement or document how to use edit distances that consider the keyboard > layout > --------------------------------------------------------------------------------- > > Key: TEXT-109 > URL: https://issues.apache.org/jira/browse/TEXT-109 > Project: Commons Text > Issue Type: New Feature > Reporter: Bruno P. Kinoshita > Priority: Minor > Labels: discussion, edit-distance, help-wanted > > Most edit distances take into consideration number of "changes" required in > one string to match with another string. And they give you a value that > represent the distance between the words. > While it is helpful, when working with datasets and corpora that have been > created with keyboards (e.g. SMS, e-mail, transcripts) it is common to have > mistakes. In some cases a letter was accidentally mistyped. But the character > used is normally close to the correct character. > For example, given the word "one", and two incorrect misspellings "onr" and > "oni". The Levenshtein distance for both would be 1. But if you are aware > that the keyboard layout is English with the QUERTY layout (notice the E and > the R), so the distance between "one" and "onr", would be greater than the > distance between "one" and "oni", because in the English keyboard the letter > 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, > but by the right hand. > Here's some reference links for further research. > * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/ > * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ > * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf > * https://github.com/wsong/Typo-Distance > * > https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard > Ideally such edit distance would be extensible to support other keyboard > layouts. > There is some indication that perhaps an existing edit distance like > levenshtein could be extended to take into consideration the keyboard layout. > So perhaps a new edit distance is not entirely necessary. > We could come with the the decision that it is too hard to implement, and it > would be better done in a spell checker, or that it would require some > statistics and would be out of the scope of Text. Or we could simply add > documentation on how to do it, without adding any code. Or, perhaps we add a > new edit distance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)