The attachment. Jaume
2013/4/20 Jaume Ortolà i Font <[email protected]> > Hi, > > As Marcin said, single character-multiple character substitutions are > tricky in the algorithm. So far I have come up with no solution. > > For the rest of cases I explained before, I think there is an easy and > general solution. When comparing characters at a given depth between the > original word and the candidate in ed(), this comparison can be made > case-insensitive and "diacritics-insensitive". This way we'll get more > proper suggestions. And furthermore if the substitutions are only case > changes or addition/removal of diacritics, then the total distance will be > zero, and these suggestions will appear in the first positions, as desired. > Is this valid for all languages? > > I attach Speller.java with my implementation (search for "by Jaume > Ortola"). Probably it can be implemented in some more efficient way. > > When using case-insensitive character comparison, there is a bug. But I > think it already existed previously. Some strings ending in "_A" are > accepted as candidates. For example, "Pec_A" is a suggestion for "Pecra". > These suggestions can be easily discarded, but the bug probably indicates > that something is not done properly in the algorithm. The problem > disappears when you use case-sensitive character comparison. > > Best, > Jaume Ortolà > > > Salutacions, > Jaume Ortolà > www.riuraueditors.cat > > > > 2013/4/18 Marcin Miłkowski <[email protected]> > >> W dniu 2013-04-18 16:28, Daniel Naber pisze: >> > On 18.04.2013, 14:41:21 Jaume Ortolà i Font wrote: >> > >> > Hi Jaume, >> > >> >> For achieving this, I think that some changes in a word should be >> >> considered as representing a lesser "distance" from the original word >> >> than others. In Catalan, for example, these changes could be: >> > >> > the right approach is to add this into the algorithm that traverses the >> > dictionary tree. For German, I needed a solution fast and ended up with >> a >> > hack in GermanSpellerRule. It's easy to understand, but if you could >> check >> > the morfologik algorithm we use and improve that, it would be great. >> Marcin >> > can probably give some directions I think. >> >> Yes, Daniel is right. The dictionary tree is traversed quite >> straightforwardly; what you want is to prefer some paths not only based >> on simple distance, as we do now, but simply based on similarities. For >> that, fsa_spell uses a simple replacement table, which is easy for >> character-for-character substitution but does not work for >> character-multiple characters substitution. We simply need something >> more general, and we can generate candidate words during tree traversal. >> The code you need to change belongs to morfologik-speller, in Speller >> class, in particular findRepl. The idea of findRepl is quite easy: for a >> given depth (=position in the string) we try to find an entry in >> dictionary (arc) that is not too far in terms of the edit distance. If >> we still have edit distance < limit, and we find that the arc is >> terminal (no other characters in the word), then we add a candidate; >> otherwise, we move one character and try to find another candidate. This >> is all recursive, so we end up with multiple candidates. >> >> findRepl simply uses a loop to go through all arcs in the dictionary and >> stops traversing if the edit distance is too far. There are basically >> two ways of changing this behavior: >> >> (a) change the way edit distance is calculated in cuted(), by allowing >> for more flexible replacements, but this is quite tricky; >> >> (b) do not call cuted for single character-multiple character >> substitutions in a given substitution table; >> >> (c) prepare the list of all possible substitutions in the original word >> and use the original findRepl (similar to your idea, basically); but I'm >> not quite sure if this way, we would really find all candidates we want >> to find. >> >> I think we need to focus on (a) or (b); maybe an additional parameter to >> cuted might tell the function to treat multiple character replacement as >> a single character replacement... >> >> Now, cuted() uses a very smart (not mine!) way of precomputing of edit >> distances in a matrix, based on Jan Daciuk's improvements to Oflazer's >> algorithm. But there is a bug in the original algorithm (the matrix H >> that represents a matrix has wrong dimensions) and Jan does not have >> time to remove it. I don't know how to fix it - I made a dirty trick >> (look for "FIXME" in the code). This is probably quite trivial but I >> don't have time to reread Jan's code and Oflazer's paper. >> >> Best, >> Marcin >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Languagetool-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > >
Speller.java
Description: Binary data
------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
