Hi,

Myspell/Hunspell has a heuristic ngram suggestion algorithm for
`very poorly' spelled words. Hunspell has an improved classification
(see test data: http://qa.openoffice.org/issues/show_bug.cgi? id=35725), but it seems, also has a too small/strict default suggestion number sometimes.
I will fix this problem with a more flexible suggestion number.

FWIW: MySpell uses a two-pass approach ngram approach

1. use ngram scoring without length penalty to generate the "closest" set of root words

2. then expand each of those root words with all prefixes and suffixes added and rescore with ngrams scoring that uses a length mismatch penalty to create a reasonable set of suggestions.

The problems are:

- how many root words should be kept from pass 1? If too small then good suggestions will never be generated. If too big then ...

-  how many suggestions should be kept from pass 2?

- how to throw away ngram suggestions that are simply horrible.

To determine if an ngram suggestion is too horrible, Myspell takes the given word and creates an intentionally bad word by making 3 or 4 changes and then ngram scoring that against the original to come up with a "lower bound" score that all good suggestions should beat.

I think you will have to experiment with how many ngram suggestions to keep in the running both in pass1 and in pass 2

Possibly sorting them not by score itself but by the longest common substring might be a usable approach.

Either way, it will take some fine tuning and there is no guarantee that it will work well for all languages. That is why Myspell only invokes ngram scoring when no suggestions are generated by edit length 1 changes, or related characters, or replacement tables approaches.

My 2 cents,

Kevin





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to