Marcin Miłkowski <list-addr...@wp.pl> wrote: > W dniu 2012-06-04 22:51, Dominique Pellé pisze: > > Hi > > > > Another problem with spell checking, is the quality > > of the Esperanto Hunspell dictionary. It's not good > > enough. Too many correct words are highlighted > > because they are missing in the dictionary. That's > > not LT's faults here. > > OK, I disabled HunspellRule for Esperanto.
Good. > > Given all the unresolved issues at least in all the languages > > that I maintain (br, fr, eo), can we consider turning > > Hunspell off by default? I'm concerned that people > > downloading the nightly build will experience many > > spurious errors. > > Well, I fixed errors for Breton, and there's a fix for French, so maybe > it's not a big problem? Thanks. Breton spelling checker looks much better now. French is better. However, for French, it only accepts the ASCII apostrophe (U+0027) but not the fancy Apostrophe U+2019. Yet U+2019 the recommended Apostrophe to use, at least in French. In other words, a word like "jusqu'à" is OK but a word like "jusqu’à" is marked as a typo (yet it should the preferred spelling). I see that the Breton Hunspell accepts both apostrophes so we should be able to fix it in French as is done for Breton. I have not had the time to look at it yet. > > As an experiment, I also commented out the Hunspell > > rule in src/java/org/languagetool/language/Breton.java > > and LT is then more than twice faster (even when comparing > > using -d HUNSPELL_RULE). > > It's because loading a language activates the HunspellRule constructor, > and the constructor reads files from disk. I see that Daniel made it initialize lazily and now using -d HUNSPELL_RULE has no measurable overhead anymore compared to the old version prior to the Hunspell checkins. See my new measurements in a previous email. Good. > I'm not a fan of hunspell; I think it has a wrong approach for creating > suggestions because the computational complexity of its algorithm is > simply too high. It should use something else, such as composition of a > Levenshtein distance automaton with a dictionary automaton, and that > would be really fast (such an approach is used by suggest methods in > Lucene). Its "user-friendly" representation of affixation could be even > nicer with twolc/lexc files for creating automata. Anyway, voikko (the > Finish speller) might has scripts to convert hunspell files to such > automata, and we might, in the future, use a better algorithm. If my > algorithm for morfologik turns out to be implemented correctly, we may > also use it for some languages -- Polish is *pathetically* slow in > hunspell. Right now, however, for pragmatic reasons, I would vote for > hunspell in LanguageTool 1.8. OK. I'll do more test in the coming days. I hope I did not sound too negative with all the reported issues. The speed is also much better than in the initial checkin. Somehow changing the Hunspell lib sped it up. I'm also glad that almost all issues with Hunspell are fixed already, except at least for the U+2019 apostrophe in French mentioned above. Thanks! -- Dominique ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel