Re: [lingu-dev] HunSpell: disappointment

nemeth Mon, 12 Sep 2005 18:41:41 -0700

Hi Kevin,

Quoting "Kevin B. Hendricks" <[EMAIL PROTECTED]>:


> Hi,
>
> > Myspell/Hunspell has a heuristic ngram suggestion algorithm for
> > `very poorly' spelled words. Hunspell has an improved classification
> > (see test data: http://qa.openoffice.org/issues/show_bug.cgi?
> > id=35725),
> > but it seems, also has a too small/strict default suggestion number
> > sometimes.
> > I will fix this problem with a more flexible suggestion number.
>
> FWIW: MySpell uses a two-pass approach ngram approach
>
> 1.  use ngram scoring without length penalty to generate the
> "closest" set of root words
>
> 2. then expand each of those root words with all prefixes and
> suffixes added  and rescore with ngrams scoring that uses a length
> mismatch penalty to create a reasonable set of suggestions.

I admire your great solution. It is a very important progress in
suggestions. Thank you for this.

>
> The problems are:
>
> - how many root words should be kept from pass 1?  If too small then
> good suggestions will never be generated.  If too big then ...
>
> -  how many suggestions should be kept from pass 2?
>
> - how to throw away ngram suggestions that are simply horrible.
>
> To determine if an ngram suggestion is too horrible, Myspell takes
> the given word and creates an intentionally bad word by making 3 or 4
> changes and then ngram scoring that against the original to come up
> with a "lower bound" score that all good suggestions should beat.
>
> I think you will have to experiment with how many ngram suggestions
> to keep in the running both in pass1 and in pass 2
>
> Possibly sorting them not by score itself but by the longest common
> substring might be a usable approach.

Yes, ngram is not enough. Hunspell uses a similar approach weighting
with the lenght of the longest common subsequence. It works fine.

Lower bound was also a big problem: Anagrams or quasi anagrams
of short words and horrible affixed forms of a good word also
have good score. Next version of Hunspell uses a weighted lower bound
instead of reducing maximum ngram suggestion.

>
> Either way, it will take some fine tuning and there is no guarantee
> that it will work well for all languages.  That is why Myspell only
> invokes ngram scoring when no suggestions are generated by edit
> length 1 changes, or related characters, or replacement tables
> approaches.

I agree with you. (Plus ngram is time-consuming.)

>
> My 2 cents,

Many thanks for your explanation.

Laci

>
> Kevin
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] HunSpell: disappointment

Reply via email to