Hi Kevin, Quoting "Kevin B. Hendricks" <[EMAIL PROTECTED]>:
> Hi, > > > Myspell/Hunspell has a heuristic ngram suggestion algorithm for > > `very poorly' spelled words. Hunspell has an improved classification > > (see test data: http://qa.openoffice.org/issues/show_bug.cgi? > > id=35725), > > but it seems, also has a too small/strict default suggestion number > > sometimes. > > I will fix this problem with a more flexible suggestion number. > > FWIW: MySpell uses a two-pass approach ngram approach > > 1. use ngram scoring without length penalty to generate the > "closest" set of root words > > 2. then expand each of those root words with all prefixes and > suffixes added and rescore with ngrams scoring that uses a length > mismatch penalty to create a reasonable set of suggestions. I admire your great solution. It is a very important progress in suggestions. Thank you for this. > > The problems are: > > - how many root words should be kept from pass 1? If too small then > good suggestions will never be generated. If too big then ... > > - how many suggestions should be kept from pass 2? > > - how to throw away ngram suggestions that are simply horrible. > > To determine if an ngram suggestion is too horrible, Myspell takes > the given word and creates an intentionally bad word by making 3 or 4 > changes and then ngram scoring that against the original to come up > with a "lower bound" score that all good suggestions should beat. > > I think you will have to experiment with how many ngram suggestions > to keep in the running both in pass1 and in pass 2 > > Possibly sorting them not by score itself but by the longest common > substring might be a usable approach. Yes, ngram is not enough. Hunspell uses a similar approach weighting with the lenght of the longest common subsequence. It works fine. Lower bound was also a big problem: Anagrams or quasi anagrams of short words and horrible affixed forms of a good word also have good score. Next version of Hunspell uses a weighted lower bound instead of reducing maximum ngram suggestion. > > Either way, it will take some fine tuning and there is no guarantee > that it will work well for all languages. That is why Myspell only > invokes ngram scoring when no suggestions are generated by edit > length 1 changes, or related characters, or replacement tables > approaches. I agree with you. (Plus ngram is time-consuming.) > > My 2 cents, Many thanks for your explanation. Laci > > Kevin > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
