Re: combining open office spellchecker with Lucene

David Spencer Thu, 09 Sep 2004 09:51:11 -0700

Andrzej Bialecki wrote:

David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too...


Sounds interesting/fun but I'm not sure if I'm following exactly.

Let's talk thru the trigram index case. Are you saying that for every trigram in every word there will be a mapping of trigram -> term? Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...

And so on for all terms in the orig index. OK fine. But now the user types in a query like "recursivz". What's the algorithm - obviously I guess take all trigrams in the bad term and go thru the trigram-index, but there will be lots of suggestions. Now what - use string distance to score them? I guess that makes sense - plz confirm if I understand.... And so I guess the point here is we precalculate the trigram->term mappings to avoid an expensive traversal of all terms in an index, but we still use string distance as a 2nd pass (and prob should force the matches to always match on the 1st n (3) chars using the heuristic that people can usually start the spelling a word corrrectly).

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: combining open office spellchecker with Lucene

Reply via email to