Tate Avery wrote:

I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp

How embarassing! Sorry! Fixed!




T

-----Original Message-----
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Andrzej Bialecki wrote:


David Spencer wrote:


I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars.


...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too...



Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher


[6]

If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback.

thx,
  Dave




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to