Re: More on spelling suggestions

Jaume Ortolà i Font Sun, 28 Apr 2013 14:37:47 -0700

Hi Marcin,

The condition


&& !(containsSeparators && candidate[depth] == (char)
dictionaryMetadata.separator)

at line 345, should go, instead, at line 339, as we don't want to "add
candidates" containing a separator character.



As for getAllReplacements, I think it could be recursive with an index
(starting fromIndex=0). Something like this (I haven't tried it):


        List<String> getAllReplacements(final String str, final String src,
final String rep, final int fromIndex) {
            List<String> replaced = new ArrayList<String>();
            StringBuilder sb = new StringBuilder();
            sb.append(str);
            int index = fromIndex;
            while (index != -1) {
                index = sb.indexOf(src, index);
                if (index != -1) {
                    //TODO: we replace the strings one by one
                    // e.g., "abcdabxyzab", key = ab, rep = eg =>
"egcdabxyzab", "egcdegxyzeg"
                    // but we also need to have "abcdegxyzeg", "abcdegxyzab"...
                    sb.replace(index, index + src.length(), rep);
                    replaced.add(sb.toString());
                    replaced.addAll(getAllReplacements(str,src,rep,index+1));
                }       

            }
            return replaced;
        }




Regards,
Jaume


2013/4/28 Marcin Miłkowski <[email protected]>

> Hi,
>
> I have just implemented the approach similar to 2 & 3 (with some
> simplifications, as I will point out later). The replacements are
> specified in the .info file, for example:
>
> fsa.dict.speller.equivalent-chars=x \u017a, l \u0142, u \u00f3, \u00f3 u
> fsa.dict.speller.replacement-pairs=rz \u017c, \u017c rz, ch h, h ch
>
> (note the unicode escape chars, required for Java property files). It
> also deals with equivalent characters ("equivalent chars"), which is
> similar to hunspell's MAP feature.
>
> The new code is at github:
>
>
> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java
>
> Two known limitations:
>
> (1) I don't check if the replaced word is already a correct one. I
> should, and this is a small fix to be added around line 284 (add to
> candidates if !isMisspelled).
>
> (2) The getAllReplacements method is simplified right now (see TODO at
> line 541). I cannot find a clean way to find all combinations of
> possible replacements in a given string, if there are multiple instances
> of the replaced substring. If you see a nice solution, please let me
> know. I seem to be stuck on this trivial thing.
>
> I also implemented new properties to ignore all uppercase words and to
> ignore CamelCase words.
>
> Now the only remaining large feature is to make runons better: I don't
> know if banned suffixes or prefixes should go to the property file or to
> the dictionary file directly.
>
> After we have this, only the conversion of hunspell two-level affix
> dictionaries will be the issue.
>
> Regards,
> Marcin
>
> W dniu 2013-04-25 11:19, Jaume Ortolà i Font pisze:
> > As predicted, the code I wrote for multiple character substitutions had
> > several bugs. I solved them (see the attachment), but more problems
> > could arise with other languages or other substitutions.
> >
> > Here I would like to talk about another approach for generating spelling
> > suggestions: just checking the words with substitutions directly.
> > Several steps could be done, but each step is taken only if no
> > suggestions have been found in the previous one. These could be the
> steps:
> >
> > 1) Make a tree search.
> > 2) Prepare words with substitutions. Are they misspelled words?
> > 3) Make a new tree search of words with substitutions.
> >
> > Note that step 2) is very low cost, and step 3) is high cost. Step 2)
> > could even be the first step.
> >
> > Would this approach be more or less efficient? It depends on the kind
> > and the number of errors we find in the texts. When there is only one or
> > more errors of multiple character substitution, then it will be faster.
> > When there is one error of multiple character substitution plus another
> > kind of error, then it will be slower. So the only way to decide which
> > is better is to try both and see which is better statistically.
> >
> > Note that using multiple character substitution inside the tree search
> > algorithm is not so costly as repeating the tree search, but it is
> > something in between.
> >
> > Best regards,
> > Jaume
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Try New Relic Now & We'll Send You this Cool Shirt
> > New Relic is the only SaaS-based application performance monitoring
> service
> > that delivers powerful full stack analytics. Optimize and monitor your
> > browser, app, & servers with just a few lines of code. Try New Relic
> > and get this awesome Nerd Life shirt!
> http://p.sf.net/sfu/newrelic_d2d_apr
> >
> >
> >
> > _______________________________________________
> > Languagetool-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: More on spelling suggestions

Reply via email to