Re: More on spelling suggestions

Paolo Bianchini Mon, 29 Apr 2013 08:28:25 -0700

On Apr 29, 2013, at 10:11 AM, Marcin Miłkowski wrote:

> W dniu 2013-04-28 23:37, Jaume Ortolà i Font pisze:
>> Hi Marcin,
>> 
>> The condition
>> 
>> 
>> && !(containsSeparators &&
>> candidate[depth]==(char)dictionaryMetadata.separator)
>> 
>> 
>> 
>> at line 345, should go, instead, at line 339, as we don't want to "add
>> candidates" containing a separator character.
> 
> We don't want to add such candidates, but a candidate is only added when 
> its last character is the terminal arc or when it is immediately 
> preceding the separator. So basically the candidate with a separator 
> cannot be added here at all because we stopped processing at the 
> separator on line 345.
> 
> This is why adding any further condition doesn't make sense to me.
> 
>> 
>> 
>> 
>> As for getAllReplacements, I think it could be recursive with an index
>> (starting fromIndex=0). Something like this (I haven't tried it):
> 
> Could you try it? From how I understand your code, it would only lead to 
> replacing all instances but we want to replace more... Let me use a 
> trivial example:
> 
> AbAcAdAe
> 
> Say we want to replace "A" with "123". Then we want to have:
> 
> 123bAcAdAe
> Ab123cAdAe
> AbAc123dAe
> AbAcAd123e
> 123b123cAdAe
> 123bAc123d123e
> Ab123Acd123e
> 123bAc123d123e
> 
> etc.
> 
> In other words, we have 4 possible slots for "123" (all matches of "A"), 
> and we should have cases where:
> 
> - only one slot is filled at any position
> - only two slots are filled at any position
> - only three slots are filled at any position
> - all four slots are filled
> 
> So, we need to have something like a Cartesian product... Something like 
> this: 
> http://stackoverflow.com/questions/14841652/string-replacement-combinations
>


I did not look at the link you sent but... to stick with your example, you have 
four occurrences of the pattern that you want to replace. This means that you 
have 2^4 possible strings resulting and you need to replace the occurrences 
just as if you were counting on a binary basis

0001
0010
0011
0100
0101
0110
0111
1000
1001
…
1111

therefore, a possible solution could be:

1) find the number of occurrences of the pattern to replace, store it in N

2) iterate for i until 2^N

3) at each iteration replace or not occurrence in position j (with 0 < j < N) 
if the binary  representation  of i in pos j is on or off (you might want to 
use some kind of bitwise AND operation i && 2^0, i && 2^1, I && 2^2, … , i && 
2^N, add to result list

Ciao

Paolo





> Best,
> Marcin
> 
> 
>> 
>> 
>> List<String> getAllReplacements(final String str, final String src,
>> final String rep, final int fromIndex) {
>> 
>> 
>> List<String> replaced = new ArrayList<String>();
>> 
>> 
>> StringBuilder sb = new StringBuilder();
>> 
>> 
>> sb.append(str);
>> 
>> 
>> intindex=fromIndex;
>> 
>> 
>> while (index != -1) {
>> 
>> 
>> index = sb.indexOf(src, index);
>> 
>> 
>> if (index != -1) {
>> 
>> 
>> //TODO: we replace the strings one by one
>> 
>> 
>> // e.g., "abcdabxyzab", key = ab, rep = eg => "egcdabxyzab", "egcdegxyzeg"
>> 
>> 
>> // but we also need to have "abcdegxyzeg", "abcdegxyzab"...
>> 
>> 
>> sb.replace(index, index + src.length(), rep);
>> 
>> 
>> replaced.add(sb.toString());
>> 
>> 
>> replaced.addAll(getAllReplacements(str,src,rep,index+1));
>> }    
>> }
>> return replaced;
>> 
>> 
>> }
>> 
>> 
>> 
>> 
>> Regards,
>> Jaume
>> 
>> 
>> 2013/4/28 Marcin Miłkowski <[email protected] <mailto:[email protected]>>
>> 
>>    Hi,
>> 
>>    I have just implemented the approach similar to 2 & 3 (with some
>>    simplifications, as I will point out later). The replacements are
>>    specified in the .info file, for example:
>> 
>>    fsa.dict.speller.equivalent-chars=x \u017a, l \u0142, u \u00f3, \u00f3 u
>>    fsa.dict.speller.replacement-pairs=rz \u017c, \u017c rz, ch h, h ch
>> 
>>    (note the unicode escape chars, required for Java property files). It
>>    also deals with equivalent characters ("equivalent chars"), which is
>>    similar to hunspell's MAP feature.
>> 
>>    The new code is at github:
>> 
>>    
>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java
>> 
>>    Two known limitations:
>> 
>>    (1) I don't check if the replaced word is already a correct one. I
>>    should, and this is a small fix to be added around line 284 (add to
>>    candidates if !isMisspelled).
>> 
>>    (2) The getAllReplacements method is simplified right now (see TODO at
>>    line 541). I cannot find a clean way to find all combinations of
>>    possible replacements in a given string, if there are multiple instances
>>    of the replaced substring. If you see a nice solution, please let me
>>    know. I seem to be stuck on this trivial thing.
>> 
>>    I also implemented new properties to ignore all uppercase words and to
>>    ignore CamelCase words.
>> 
>>    Now the only remaining large feature is to make runons better: I don't
>>    know if banned suffixes or prefixes should go to the property file or to
>>    the dictionary file directly.
>> 
>>    After we have this, only the conversion of hunspell two-level affix
>>    dictionaries will be the issue.
>> 
>>    Regards,
>>    Marcin
>> 
>>    W dniu 2013-04-25 11:19, Jaume Ortolà i Font pisze:
>>> As predicted, the code I wrote for multiple character
>>    substitutions had
>>> several bugs. I solved them (see the attachment), but more problems
>>> could arise with other languages or other substitutions.
>>> 
>>> Here I would like to talk about another approach for generating
>>    spelling
>>> suggestions: just checking the words with substitutions directly.
>>> Several steps could be done, but each step is taken only if no
>>> suggestions have been found in the previous one. These could be
>>    the steps:
>>> 
>>> 1) Make a tree search.
>>> 2) Prepare words with substitutions. Are they misspelled words?
>>> 3) Make a new tree search of words with substitutions.
>>> 
>>> Note that step 2) is very low cost, and step 3) is high cost. Step 2)
>>> could even be the first step.
>>> 
>>> Would this approach be more or less efficient? It depends on the kind
>>> and the number of errors we find in the texts. When there is only
>>    one or
>>> more errors of multiple character substitution, then it will be
>>    faster.
>>> When there is one error of multiple character substitution plus
>>    another
>>> kind of error, then it will be slower. So the only way to decide
>>    which
>>> is better is to try both and see which is better statistically.
>>> 
>>> Note that using multiple character substitution inside the tree
>>    search
>>> algorithm is not so costly as repeating the tree search, but it is
>>> something in between.
>>> 
>>> Best regards,
>>> Jaume
>>> 
>>> 
>>> 
>>> 
>>    
>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance
>>    monitoring service
>>> that delivers powerful full stack analytics. Optimize and monitor
>>    your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>    http://p.sf.net/sfu/newrelic_d2d_apr
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> [email protected]
>>    <mailto:[email protected]>
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>> 
>> 
>> 
>>    
>> ------------------------------------------------------------------------------
>>    Try New Relic Now & We'll Send You this Cool Shirt
>>    New Relic is the only SaaS-based application performance monitoring
>>    service
>>    that delivers powerful full stack analytics. Optimize and monitor your
>>    browser, app, & servers with just a few lines of code. Try New Relic
>>    and get this awesome Nerd Life shirt!
>>    http://p.sf.net/sfu/newrelic_d2d_apr
>>    _______________________________________________
>>    Languagetool-devel mailing list
>>    [email protected]
>>    <mailto:[email protected]>
>>    https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
>> 
>> 
>> 
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>> 
> 
> 
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service 
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: More on spelling suggestions

Reply via email to