Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

David Spencer Tue, 14 Sep 2004 12:28:57 -0700

Tate Avery wrote:

I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp


How embarassing!
Sorry!
Fixed!

T
-----Original Message-----
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word.
Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections.
http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher
[6]
If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback.
thx,
  Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Reply via email to