Aad Nales wrote:

By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be:


1. my expectations (most likely ;-)
2. something in the code..


Good question.

If I use the form at the bottom of the page and ask for more results, the suggestion of "const" does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0

Even boosting the prefix match from 2.0 to 10.0 only changes the ranking a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0


To restate the question for a second.

The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low.


But - I guess the problem w/ the algorithm is that for short words like this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...

conts -> con ont nts
const -> con ons nst

Thus they just share 1 3gram, thus this is why it scores so low. This is an interesting issue, how to tune the algorithm so that it might return words this close higher.

I guess one way is to add all simple transpositions to the lookup table (the "ngram index") so that these could easily be found, with the heuristic that "a frequent way of misspelling words is to transpose two adjacent letters".

Based on other mails I'll make some additions to the code and will report back if anything of interest changes here.





-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene



Aad Nales wrote:


David,

Perhaps I misunderstand somehting so please correct me if I do. I used


http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequency in your index and


??? how do you know that? Remember, this is an index of _Java_docs, and "const" is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes this behavior?






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to