W dniu 2013-04-28 23:37, Jaume Ortolà i Font pisze: > Hi Marcin, > > The condition > > > && !(containsSeparators && > candidate[depth]==(char)dictionaryMetadata.separator) > > > > at line 345, should go, instead, at line 339, as we don't want to "add > candidates" containing a separator character.
We don't want to add such candidates, but a candidate is only added when its last character is the terminal arc or when it is immediately preceding the separator. So basically the candidate with a separator cannot be added here at all because we stopped processing at the separator on line 345. This is why adding any further condition doesn't make sense to me. > > > > As for getAllReplacements, I think it could be recursive with an index > (starting fromIndex=0). Something like this (I haven't tried it): Could you try it? From how I understand your code, it would only lead to replacing all instances but we want to replace more... Let me use a trivial example: AbAcAdAe Say we want to replace "A" with "123". Then we want to have: 123bAcAdAe Ab123cAdAe AbAc123dAe AbAcAd123e 123b123cAdAe 123bAc123d123e Ab123Acd123e 123bAc123d123e etc. In other words, we have 4 possible slots for "123" (all matches of "A"), and we should have cases where: - only one slot is filled at any position - only two slots are filled at any position - only three slots are filled at any position - all four slots are filled So, we need to have something like a Cartesian product... Something like this: http://stackoverflow.com/questions/14841652/string-replacement-combinations Best, Marcin > > > List<String> getAllReplacements(final String str, final String src, > final String rep, final int fromIndex) { > > > List<String> replaced = new ArrayList<String>(); > > > StringBuilder sb = new StringBuilder(); > > > sb.append(str); > > > intindex=fromIndex; > > > while (index != -1) { > > > index = sb.indexOf(src, index); > > > if (index != -1) { > > > //TODO: we replace the strings one by one > > > // e.g., "abcdabxyzab", key = ab, rep = eg => "egcdabxyzab", "egcdegxyzeg" > > > // but we also need to have "abcdegxyzeg", "abcdegxyzab"... > > > sb.replace(index, index + src.length(), rep); > > > replaced.add(sb.toString()); > > > replaced.addAll(getAllReplacements(str,src,rep,index+1)); > } > } > return replaced; > > > } > > > > > Regards, > Jaume > > > 2013/4/28 Marcin Miłkowski <[email protected] <mailto:[email protected]>> > > Hi, > > I have just implemented the approach similar to 2 & 3 (with some > simplifications, as I will point out later). The replacements are > specified in the .info file, for example: > > fsa.dict.speller.equivalent-chars=x \u017a, l \u0142, u \u00f3, \u00f3 u > fsa.dict.speller.replacement-pairs=rz \u017c, \u017c rz, ch h, h ch > > (note the unicode escape chars, required for Java property files). It > also deals with equivalent characters ("equivalent chars"), which is > similar to hunspell's MAP feature. > > The new code is at github: > > > https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java > > Two known limitations: > > (1) I don't check if the replaced word is already a correct one. I > should, and this is a small fix to be added around line 284 (add to > candidates if !isMisspelled). > > (2) The getAllReplacements method is simplified right now (see TODO at > line 541). I cannot find a clean way to find all combinations of > possible replacements in a given string, if there are multiple instances > of the replaced substring. If you see a nice solution, please let me > know. I seem to be stuck on this trivial thing. > > I also implemented new properties to ignore all uppercase words and to > ignore CamelCase words. > > Now the only remaining large feature is to make runons better: I don't > know if banned suffixes or prefixes should go to the property file or to > the dictionary file directly. > > After we have this, only the conversion of hunspell two-level affix > dictionaries will be the issue. > > Regards, > Marcin > > W dniu 2013-04-25 11:19, Jaume Ortolà i Font pisze: > > As predicted, the code I wrote for multiple character > substitutions had > > several bugs. I solved them (see the attachment), but more problems > > could arise with other languages or other substitutions. > > > > Here I would like to talk about another approach for generating > spelling > > suggestions: just checking the words with substitutions directly. > > Several steps could be done, but each step is taken only if no > > suggestions have been found in the previous one. These could be > the steps: > > > > 1) Make a tree search. > > 2) Prepare words with substitutions. Are they misspelled words? > > 3) Make a new tree search of words with substitutions. > > > > Note that step 2) is very low cost, and step 3) is high cost. Step 2) > > could even be the first step. > > > > Would this approach be more or less efficient? It depends on the kind > > and the number of errors we find in the texts. When there is only > one or > > more errors of multiple character substitution, then it will be > faster. > > When there is one error of multiple character substitution plus > another > > kind of error, then it will be slower. So the only way to decide > which > > is better is to try both and see which is better statistically. > > > > Note that using multiple character substitution inside the tree > search > > algorithm is not so costly as repeating the tree search, but it is > > something in between. > > > > Best regards, > > Jaume > > > > > > > > > > ------------------------------------------------------------------------------ > > Try New Relic Now & We'll Send You this Cool Shirt > > New Relic is the only SaaS-based application performance > monitoring service > > that delivers powerful full stack analytics. Optimize and monitor > your > > browser, app, & servers with just a few lines of code. Try New Relic > > and get this awesome Nerd Life shirt! > http://p.sf.net/sfu/newrelic_d2d_apr > > > > > > > > _______________________________________________ > > Languagetool-devel mailing list > > [email protected] > <mailto:[email protected]> > > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring > service > that delivers powerful full stack analytics. Optimize and monitor your > browser, app, & servers with just a few lines of code. Try New Relic > and get this awesome Nerd Life shirt! > http://p.sf.net/sfu/newrelic_d2d_apr > _______________________________________________ > Languagetool-devel mailing list > [email protected] > <mailto:[email protected]> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring service > that delivers powerful full stack analytics. Optimize and monitor your > browser, app, & servers with just a few lines of code. Try New Relic > and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr > > > > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
