Hi,
Myspell/Hunspell has a heuristic ngram suggestion algorithm for
`very poorly' spelled words. Hunspell has an improved classification
(see test data: http://qa.openoffice.org/issues/show_bug.cgi?
id=35725),
but it seems, also has a too small/strict default suggestion number
sometimes.
I will fix this problem with a more flexible suggestion number.
FWIW: MySpell uses a two-pass approach ngram approach
1. use ngram scoring without length penalty to generate the
"closest" set of root words
2. then expand each of those root words with all prefixes and
suffixes added and rescore with ngrams scoring that uses a length
mismatch penalty to create a reasonable set of suggestions.
The problems are:
- how many root words should be kept from pass 1? If too small then
good suggestions will never be generated. If too big then ...
- how many suggestions should be kept from pass 2?
- how to throw away ngram suggestions that are simply horrible.
To determine if an ngram suggestion is too horrible, Myspell takes
the given word and creates an intentionally bad word by making 3 or 4
changes and then ngram scoring that against the original to come up
with a "lower bound" score that all good suggestions should beat.
I think you will have to experiment with how many ngram suggestions
to keep in the running both in pass1 and in pass 2
Possibly sorting them not by score itself but by the longest common
substring might be a usable approach.
Either way, it will take some fine tuning and there is no guarantee
that it will work well for all languages. That is why Myspell only
invokes ngram scoring when no suggestions are generated by edit
length 1 changes, or related characters, or replacement tables
approaches.
My 2 cents,
Kevin
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]