Hi, Jaume and all, I'm newbie here but I have a question. Why there isn't frequency list for galician in https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries ?
2013/12/9 Jaume Ortolà i Font <jaumeort...@gmail.com>: > Hi, > > I have implemented the use of the frequency word lists, and it works well > for me. I will make pull requests to both LanguageTool and Morfologik > projects. In LanguageTool, I added to POSDictionaryBuilder the option of > reading frequency data from files like ca_wordlist.xml. > > Now there are only minor problems in some tests in Polish and German, which > use '+' as a separator in the FSA dictionary. The same character is used > also in some POS tags. If a flag for frequency data is added to the > dictionary properties (which anyway is advisable), the problem will be > solved. In fact, with this flag, we could consider that the last byte is the > frequency data and the separator between POS tag and frequency is not > needed. > > The other solution is to change the separator... > > Regards, > Jaume Ortolà > > > > > 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl> >> >> W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze: >> > 2013/11/26 Daniel Naber <list2...@danielnaber.de >> > <mailto:list2...@danielnaber.de>> >> > >> > On 2013-11-26 15:27, Jaume Ortolà i Font wrote: >> > >> > > Look at these wordlists [1]. They are Apache 2.0. The words are >> > > classified in 256 ranges. >> > >> > > [1] >> > >> > https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries >> > >> > The German one looks okay. Unless the quality is a serious problem >> > for >> > some language, I'd suggest to simple use these lists. >> > >> > >> > I think the quality won't be a serious problem (even with the >> > tokenization differences in Catalan). The goal is just to avoid common >> > words (usually short ones) being hidden by dozens of other uncommon >> > words in spelling suggestions. So these wordlists seem good enough. >> > >> > Now, we need Marcin to say something about how to add this data to the >> > FSA dictionaries. I guess we just need to add an extra field (with a >> > separator) after the POS tag. >> >> Exactly. Also, there is already some code in the speller that sorts the >> suggestions and could use the frequency to sort them as well, not only >> the edit distance (in the CandidateData internal class in Speller). >> >> The dictionary can already contain separators, so we can freely use a >> second field. Maybe we should have another flag to say that this field >> is actually used for frequency. >> >> Note: in 1.8, we changed the dictionary building process because there >> was a tiny bug. I am unable to work on existing dictionaries (some are >> faulty, for example, the Slovak one) until the next week, but then we'll >> upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat >> better dictionaries. Interestingly, infix encoding turns out to be >> suboptimal right now in terms of the automaton size but prefix is very >> good. >> >> Sorry for being terse, I have an important event (habilitation exam) >> tomorrow, so I cannot write more at the moment. >> >> Best, >> Marcin >> >> >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > -- Antón Méixome - Galician Native Lang Coordination Galician community LibO & AOO ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel