W dniu 2013-12-09 00:12, Jaume Ortolà i Font pisze: > Hi, > > I have implemented the use of the frequency word lists, and it works > well for me. I will make pull requests to both LanguageTool and > Morfologik projects. In LanguageTool, I added to POSDictionaryBuilder > the option of reading frequency data from files like ca_wordlist.xml.
Excellent! > Now there are only minor problems in some tests in Polish and German, > which use '+' as a separator in the FSA dictionary. The same character > is used also in some POS tags. If a flag for frequency data is added to > the dictionary properties (which anyway is advisable), the problem will > be solved. In fact, with this flag, we could consider that the last byte > is the frequency data and the separator between POS tag and frequency is > not needed. I'd go for the flag. It makes the solution more robust. Anyway, the spelling dictionary for Polish is different from the tagger dictionary so I cannot really see why there could be any problems? There are no separators in the spelling dictionary at all. > > The other solution is to change the separator... Why would one change something which does not exist yet? It's not even introduced in the spelling dictionaries... Best, Marcin > > Regards, > Jaume Ortolà > > > > > 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> > > W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze: > > 2013/11/26 Daniel Naber <list2...@danielnaber.de > <mailto:list2...@danielnaber.de> > > <mailto:list2...@danielnaber.de <mailto:list2...@danielnaber.de>>> > > > > On 2013-11-26 15:27, Jaume Ortolà i Font wrote: > > > > > Look at these wordlists [1]. They are Apache 2.0. The > words are > > > classified in 256 ranges. > > > > > [1] > > https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries > > > > The German one looks okay. Unless the quality is a serious > problem for > > some language, I'd suggest to simple use these lists. > > > > > > I think the quality won't be a serious problem (even with the > > tokenization differences in Catalan). The goal is just to avoid > common > > words (usually short ones) being hidden by dozens of other uncommon > > words in spelling suggestions. So these wordlists seem good enough. > > > > Now, we need Marcin to say something about how to add this data > to the > > FSA dictionaries. I guess we just need to add an extra field (with a > > separator) after the POS tag. > > Exactly. Also, there is already some code in the speller that sorts the > suggestions and could use the frequency to sort them as well, not only > the edit distance (in the CandidateData internal class in Speller). > > The dictionary can already contain separators, so we can freely use a > second field. Maybe we should have another flag to say that this field > is actually used for frequency. > > Note: in 1.8, we changed the dictionary building process because there > was a tiny bug. I am unable to work on existing dictionaries (some are > faulty, for example, the Slovak one) until the next week, but then we'll > upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat > better dictionaries. Interestingly, infix encoding turns out to be > suboptimal right now in terms of the automaton size but prefix is > very good. > > Sorry for being terse, I have an important event (habilitation exam) > tomorrow, so I cannot write more at the moment. > > Best, > Marcin > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility > into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of > AppDynamics Pro! > > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > <mailto:Languagetool-devel@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk > > > > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel