Doug Cutting wrote:

David Spencer wrote:

[1] The user enters a query like:
    recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is.

I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all).


Almost.

If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive.

I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction.


If true (default) then for common words like "remove", no results are returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index.

-- Dave


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to