Re: [lingu-dev] HunSpell: disappointment

Kevin B. Hendricks Mon, 12 Sep 2005 09:22:12 -0700

Hi,

Myspell/Hunspell has a heuristic ngram suggestion algorithm for
`very poorly' spelled words. Hunspell has an improved classification
(see test data: http://qa.openoffice.org/issues/show_bug.cgi?id=35725),but it seems, also has a too small/strict default suggestion numbersometimes.
I will fix this problem with a more flexible suggestion number.


FWIW: MySpell uses a two-pass approach ngram approach

1. use ngram scoring without length penalty to generate the"closest" set of root words

2. then expand each of those root words with all prefixes andsuffixes added and rescore with ngrams scoring that uses a lengthmismatch penalty to create a reasonable set of suggestions.


The problems are:

- how many root words should be kept from pass 1? If too small thengood suggestions will never be generated. If too big then ...


-  how many suggestions should be kept from pass 2?

- how to throw away ngram suggestions that are simply horrible.

To determine if an ngram suggestion is too horrible, Myspell takesthe given word and creates an intentionally bad word by making 3 or 4changes and then ngram scoring that against the original to come upwith a "lower bound" score that all good suggestions should beat.

I think you will have to experiment with how many ngram suggestionsto keep in the running both in pass1 and in pass 2

Possibly sorting them not by score itself but by the longest commonsubstring might be a usable approach.

Either way, it will take some fine tuning and there is no guaranteethat it will work well for all languages. That is why Myspell onlyinvokes ngram scoring when no suggestions are generated by editlength 1 changes, or related characters, or replacement tablesapproaches.


My 2 cents,

Kevin





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] HunSpell: disappointment

Reply via email to