W dniu 2013-04-29 17:56, Jaume Ortolà i Font pisze: > Hi, > > I have made the algorithm more general. Now we get really all the > possible combinations of replacements including multiple and different > substitutions inside a word. > > For example, for "tel·lenovela" we get four suggestions: "telenovel·la, > tel·lenovela, telenovela, tel·lenovel·la". Note that there are > replacements L->L·L and L·L->L at the same time, which are necessary to > find the right suggestion, the first one. > > The idea is that once we find a possible replacement we start two new > searches recursively: one with the replacement and another without the > replacement. Both new searches are done from an increased index, not > from the start of the word. > > The "candidates" are added to the list only when we reach a terminal > branch where no replacement is possible. > > The key and replacement pairs are now iterated > inside getAllReplacements(), so we can get different kinds of > replacements inside the word. > > This works fine with all my examples in Catalan.
I did some further tests (and wrote a Junit test for 4 replacements, and it works fine, I get 15 replacements + no replacement at all) and it works perfectly. I changed your code -- isMisspelled may return true because of the flags set in properties, and hence some replacements might be accepted just because they are all uppercase or something similar. We don't want such side effects. I introduced two new properties: fsa.dict.speller.ignore-camel-case and fsa.dict.speller.ignore-all-uppercase They are self-explanatory. I think this code is quite fine right now. What we need then is to have the prefix/suffix rejection for runon words (still not sure if this should be an entry in the .info file) and the conversion of the hunspell file. I read the code at lucene-hunspell (https://code.google.com/p/lucene-hunspell/source/browse/#svn%2Ftrunk%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fanalysis%2Fhunspell) but it uses a different strategy than I would. Namely, it just tries to stem words, and it isn't useful to unmunch the dictionary. I'm not sure if this code is the one really used with lucene (could anyone confirm?) as it's quite a bit old now and the version number is just 0.2. Anyway, it parses the affix file, which is good, and we would need to read the whole hunspell file and produce all affixed forms of the word. I don't really get CharArrayMap used in the code to represent lookup tables, but I think the modification that we need is in HunspellDictionary.java, and we need to have a new method that simply iterates through all words in the words CharArrayMap and applies all prefixes and suffixes that match. Also, we would need, unfortunately, to add support for COMPOUNDMIN, COMPOUNDFLAG etc. This would be very, very similar to affix support that is already there. Regards, Marcin ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
