Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:
rec -> recursive ecu -> ... cur -> ... urs -> ... rsi -> ... siv -> ... ive -> ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad term and go thru the trigram-index, but there will be lots of suggestions. Now what - use string distance to score them? I guess that makes sense - plz confirm if I understand.... And so I guess the point here is we precalculate the trigram->term mappings to avoid an expensive traversal of all terms in an index, but we still use string distance as a 2nd pass (and prob should force the matches to always match on the 1st n (3) chars using the heuristic that people can usually start the spelling a word corrrectly).
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
