Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low.
But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams.
Just looking at 3grams...
conts -> con ont nts const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher.
If you added 2-grams, then it would look like this (constructing also special start/end grams):
Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, and also store transpositions, and w/ appropriate boosts :) then "const" is returned 2nd.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
conts -> _c co on nt ts s_ const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4 grams, albeit for a slightly different purpose, so in this case it would be:
conts: _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const: _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.
I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters".
Yes, sounds like a good idea. Even though it increases the size of the lookup index, it still beats using the linear search...
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
