Re: More on spelling suggestions

Marcin Miłkowski Mon, 29 Apr 2013 01:11:27 -0700

W dniu 2013-04-28 23:37, Jaume Ortolà i Font pisze:
> Hi Marcin,
>
> The condition
>
>
> && !(containsSeparators &&
> candidate[depth]==(char)dictionaryMetadata.separator)
>
>
>
> at line 345, should go, instead, at line 339, as we don't want to "add
> candidates" containing a separator character.


We don't want to add such candidates, but a candidate is only added when 
its last character is the terminal arc or when it is immediately 
preceding the separator. So basically the candidate with a separator 
cannot be added here at all because we stopped processing at the 
separator on line 345.

This is why adding any further condition doesn't make sense to me.

>
>
>
> As for getAllReplacements, I think it could be recursive with an index
> (starting fromIndex=0). Something like this (I haven't tried it):

Could you try it? From how I understand your code, it would only lead to 
replacing all instances but we want to replace more... Let me use a 
trivial example:

AbAcAdAe

Say we want to replace "A" with "123". Then we want to have:

123bAcAdAe
Ab123cAdAe
AbAc123dAe
AbAcAd123e
123b123cAdAe
123bAc123d123e
Ab123Acd123e
123bAc123d123e

etc.

In other words, we have 4 possible slots for "123" (all matches of "A"), 
and we should have cases where:

- only one slot is filled at any position
- only two slots are filled at any position
- only three slots are filled at any position
- all four slots are filled

So, we need to have something like a Cartesian product... Something like 
this: 
http://stackoverflow.com/questions/14841652/string-replacement-combinations

Best,
Marcin


>
>
> List<String> getAllReplacements(final String str, final String src,
> final String rep, final int fromIndex) {
>
>
> List<String> replaced = new ArrayList<String>();
>
>
> StringBuilder sb = new StringBuilder();
>
>
> sb.append(str);
>
>
> intindex=fromIndex;
>
>
> while (index != -1) {
>
>
> index = sb.indexOf(src, index);
>
>
> if (index != -1) {
>
>
> //TODO: we replace the strings one by one
>
>
> // e.g., "abcdabxyzab", key = ab, rep = eg => "egcdabxyzab", "egcdegxyzeg"
>
>
> // but we also need to have "abcdegxyzeg", "abcdegxyzab"...
>
>
> sb.replace(index, index + src.length(), rep);
>
>
> replaced.add(sb.toString());
>
>
> replaced.addAll(getAllReplacements(str,src,rep,index+1));
> }     
> }
> return replaced;
>
>
> }
>
>
>
>
> Regards,
> Jaume
>
>
> 2013/4/28 Marcin Miłkowski <[email protected] <mailto:[email protected]>>
>
>     Hi,
>
>     I have just implemented the approach similar to 2 & 3 (with some
>     simplifications, as I will point out later). The replacements are
>     specified in the .info file, for example:
>
>     fsa.dict.speller.equivalent-chars=x \u017a, l \u0142, u \u00f3, \u00f3 u
>     fsa.dict.speller.replacement-pairs=rz \u017c, \u017c rz, ch h, h ch
>
>     (note the unicode escape chars, required for Java property files). It
>     also deals with equivalent characters ("equivalent chars"), which is
>     similar to hunspell's MAP feature.
>
>     The new code is at github:
>
>     
> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java
>
>     Two known limitations:
>
>     (1) I don't check if the replaced word is already a correct one. I
>     should, and this is a small fix to be added around line 284 (add to
>     candidates if !isMisspelled).
>
>     (2) The getAllReplacements method is simplified right now (see TODO at
>     line 541). I cannot find a clean way to find all combinations of
>     possible replacements in a given string, if there are multiple instances
>     of the replaced substring. If you see a nice solution, please let me
>     know. I seem to be stuck on this trivial thing.
>
>     I also implemented new properties to ignore all uppercase words and to
>     ignore CamelCase words.
>
>     Now the only remaining large feature is to make runons better: I don't
>     know if banned suffixes or prefixes should go to the property file or to
>     the dictionary file directly.
>
>     After we have this, only the conversion of hunspell two-level affix
>     dictionaries will be the issue.
>
>     Regards,
>     Marcin
>
>     W dniu 2013-04-25 11:19, Jaume Ortolà i Font pisze:
>      > As predicted, the code I wrote for multiple character
>     substitutions had
>      > several bugs. I solved them (see the attachment), but more problems
>      > could arise with other languages or other substitutions.
>      >
>      > Here I would like to talk about another approach for generating
>     spelling
>      > suggestions: just checking the words with substitutions directly.
>      > Several steps could be done, but each step is taken only if no
>      > suggestions have been found in the previous one. These could be
>     the steps:
>      >
>      > 1) Make a tree search.
>      > 2) Prepare words with substitutions. Are they misspelled words?
>      > 3) Make a new tree search of words with substitutions.
>      >
>      > Note that step 2) is very low cost, and step 3) is high cost. Step 2)
>      > could even be the first step.
>      >
>      > Would this approach be more or less efficient? It depends on the kind
>      > and the number of errors we find in the texts. When there is only
>     one or
>      > more errors of multiple character substitution, then it will be
>     faster.
>      > When there is one error of multiple character substitution plus
>     another
>      > kind of error, then it will be slower. So the only way to decide
>     which
>      > is better is to try both and see which is better statistically.
>      >
>      > Note that using multiple character substitution inside the tree
>     search
>      > algorithm is not so costly as repeating the tree search, but it is
>      > something in between.
>      >
>      > Best regards,
>      > Jaume
>      >
>      >
>      >
>      >
>     
> ------------------------------------------------------------------------------
>      > Try New Relic Now & We'll Send You this Cool Shirt
>      > New Relic is the only SaaS-based application performance
>     monitoring service
>      > that delivers powerful full stack analytics. Optimize and monitor
>     your
>      > browser, app, & servers with just a few lines of code. Try New Relic
>      > and get this awesome Nerd Life shirt!
>     http://p.sf.net/sfu/newrelic_d2d_apr
>      >
>      >
>      >
>      > _______________________________________________
>      > Languagetool-devel mailing list
>      > [email protected]
>     <mailto:[email protected]>
>      > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>      >
>
>
>     
> ------------------------------------------------------------------------------
>     Try New Relic Now & We'll Send You this Cool Shirt
>     New Relic is the only SaaS-based application performance monitoring
>     service
>     that delivers powerful full stack analytics. Optimize and monitor your
>     browser, app, & servers with just a few lines of code. Try New Relic
>     and get this awesome Nerd Life shirt!
>     http://p.sf.net/sfu/newrelic_d2d_apr
>     _______________________________________________
>     Languagetool-devel mailing list
>     [email protected]
>     <mailto:[email protected]>
>     https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: More on spelling suggestions

Reply via email to