Hi,

I have just implemented the approach similar to 2 & 3 (with some 
simplifications, as I will point out later). The replacements are 
specified in the .info file, for example:

fsa.dict.speller.equivalent-chars=x \u017a, l \u0142, u \u00f3, \u00f3 u
fsa.dict.speller.replacement-pairs=rz \u017c, \u017c rz, ch h, h ch

(note the unicode escape chars, required for Java property files). It 
also deals with equivalent characters ("equivalent chars"), which is 
similar to hunspell's MAP feature.

The new code is at github:

https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java

Two known limitations:

(1) I don't check if the replaced word is already a correct one. I 
should, and this is a small fix to be added around line 284 (add to 
candidates if !isMisspelled).

(2) The getAllReplacements method is simplified right now (see TODO at 
line 541). I cannot find a clean way to find all combinations of 
possible replacements in a given string, if there are multiple instances 
of the replaced substring. If you see a nice solution, please let me 
know. I seem to be stuck on this trivial thing.

I also implemented new properties to ignore all uppercase words and to 
ignore CamelCase words.

Now the only remaining large feature is to make runons better: I don't 
know if banned suffixes or prefixes should go to the property file or to 
the dictionary file directly.

After we have this, only the conversion of hunspell two-level affix 
dictionaries will be the issue.

Regards,
Marcin

W dniu 2013-04-25 11:19, Jaume Ortolà i Font pisze:
> As predicted, the code I wrote for multiple character substitutions had
> several bugs. I solved them (see the attachment), but more problems
> could arise with other languages or other substitutions.
>
> Here I would like to talk about another approach for generating spelling
> suggestions: just checking the words with substitutions directly.
> Several steps could be done, but each step is taken only if no
> suggestions have been found in the previous one. These could be the steps:
>
> 1) Make a tree search.
> 2) Prepare words with substitutions. Are they misspelled words?
> 3) Make a new tree search of words with substitutions.
>
> Note that step 2) is very low cost, and step 3) is high cost. Step 2)
> could even be the first step.
>
> Would this approach be more or less efficient? It depends on the kind
> and the number of errors we find in the texts. When there is only one or
> more errors of multiple character substitution, then it will be faster.
> When there is one error of multiple character substitution plus another
> kind of error, then it will be slower. So the only way to decide which
> is better is to try both and see which is better statistically.
>
> Note that using multiple character substitution inside the tree search
> algorithm is not so costly as repeating the tree search, but it is
> something in between.
>
> Best regards,
> Jaume
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to