Martin Braun wrote:
hi all, does anybody have practical experiences with Ling Pipes Spellchecker (http://www.alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html)?
I wrote the demo and I am the company 'system tuner' so I can perhaps help out here.
With lucenes spellcheck contribution I am not really satisfied because the Index has some (many?) mispelled words, so the did you mean class (from the jave.net example) is good in finding similar mispelled words. With the similarWords Function the correct word is only around Position 2-5 - though it should be more frequent in the index.
Not quite sure I understand what the issue is here. Is it that the similarWords returns ranked words and the correct one is too far down the ranked list?
So for know I am thinking of switching to lingpipe, but I have a couple of questions: Is it better than lucenes spell-check contribution?
It is different in that it is intended to model spelling at the character ngram level and have phrasal sensitivity. Assuming the similar Words approach is a version of edit distance, LingPipe adds in a score for how well the resulting edits 'fit' the model of the indexed data. So if editing 'Martni' to 'Martin' would be accepted if the 'Martni' fit the model much worse (a log2 estimate) than the suggestion 'Martin'. In general the best of the possible edits would be accepted.
What about performance?
Tuning params dominate the performance space. A small beam (16 active hypotheses) will be quite snappy (I have 200 queries/sec with a 32 beam. over a 80 gig text collection that with some pruning was 5 gig in memory running an 8 gram model)
What about the quality of suggestions?
For one customer we had a 1% false postive rate with 66% correction rate. We could have gotten a much higher correction rate with an increase in false positives but the customer didn't want that. This is better performance that we thought was possible.
I have not done further formal evals to report. I may get an excuse to play with the AOL query logs so that should be interesting.
Tuning is a big deal and I need to write a tuning tutorial. I am doing more teaching/training now so that may happen.
breck
Does anybody have a good idea how to find typos in the index. tia, martin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]