W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze: > 2013/11/26 Daniel Naber <list2...@danielnaber.de > <mailto:list2...@danielnaber.de>> > > On 2013-11-26 15:27, Jaume Ortolà i Font wrote: > > > Look at these wordlists [1]. They are Apache 2.0. The words are > > classified in 256 ranges. > > > [1] > https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries > > The German one looks okay. Unless the quality is a serious problem for > some language, I'd suggest to simple use these lists. > > > I think the quality won't be a serious problem (even with the > tokenization differences in Catalan). The goal is just to avoid common > words (usually short ones) being hidden by dozens of other uncommon > words in spelling suggestions. So these wordlists seem good enough. > > Now, we need Marcin to say something about how to add this data to the > FSA dictionaries. I guess we just need to add an extra field (with a > separator) after the POS tag.
Exactly. Also, there is already some code in the speller that sorts the suggestions and could use the frequency to sort them as well, not only the edit distance (in the CandidateData internal class in Speller). The dictionary can already contain separators, so we can freely use a second field. Maybe we should have another flag to say that this field is actually used for frequency. Note: in 1.8, we changed the dictionary building process because there was a tiny bug. I am unable to work on existing dictionaries (some are faulty, for example, the Slovak one) until the next week, but then we'll upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat better dictionaries. Interestingly, infix encoding turns out to be suboptimal right now in terms of the automaton size but prefix is very good. Sorry for being terse, I have an important event (habilitation exam) tomorrow, so I cannot write more at the moment. Best, Marcin ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel