I have been thinking about the question of improving spelling suggestions.

The objective is twofold: 1) to get more meaningful suggestions, and 2) to
order them in a more meaningful way.

For achieving this, I think that some changes in a word should be
considered as representing a lesser "distance" from the original word than
others. In Catalan, for example, these changes could be:

- any upper/lower case conversion
- any addition or removal of diacritical signs
- some frequent confusions or errors: o <--> u, b <--> v, L <--> L·L ,

Moreover, these suggestions should appear first.

This way, for example, we would get "col·laborin" as a suggestion for the
wrongly spelled word "colaborin" (which is not given now as a suggestion).
Or if I write "poguem" (which is wrong), I would get "puguem" as the first
suggestion (not as 12th one as it is now), or "plenàries" as the first
suggestion for "plenaries" (now the 6th suggestion).

I suppose that other languages need a similar approach.

Regards,
Jaume Ortolà



2013/4/7 Marcin Miłkowski <[email protected]>

> W dniu 2013-04-07 11:07, Jaume Ortolà i Font pisze:
> > Hi,
> >
> > I have made an improvement in Morfologik speller rule. If few
> > suggestions are found, then try to get more from the word without
> > diacritics. This is useful in Catalan, and I guess in other languages.
> >
> > The next step would be the other way around: if few suggestions are
> > found, then try with some replacement patterns (adding diacritics in
> > some cases...). This patterns have to be language-dependent, of course.
> > It's OK to write this in MorfologikSpellerRule.java (with the patterns
> > in the corresponging language rule), Marcin?
> >
> > A further step for improving suggestions could be to use a dictionary of
> > frequencies. With this information the suggestions could be ordered: the
> > more frequent words first.
>
> Actually, this should be done at the level of the automaton search to
> make it much faster. We can start hacking around the simplied code of
> the MorfologikSpeller but it could slow the whole thing drastically
> (remember that the hunspell slowdown comes from the suggestion part).
> It's not so hard to add diacritics search, as it was already the part of
> the fsa_spell, but the easy part about it was that it relied on 8-bit
> encodings, and with UTF-8, we can no longer believe that every character
> is just a byte. I don't remember how I traverse the automaton right now,
> but I believe I started with a simplistic version to add some UTF-8
> later on, so maybe it's now easier to implement.
>
> The trick is to use a special replacement table when traversing the
> automaton. This way, it's as speedy as it was before.
>
> But coding this, eh, is not so easy. Right now I'm bogged down with
> another project, and I cannot really sit down and code it...
>
> Regards,
> Marcin
>
>
> ------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to