I would appreciate some help with frequences for spelling dictionary, please.
So I created uk_wordlist.xml for Ukrainian based on huge media archive, got about 1.5 million words with frequency. Assuming I don't need that much for spelling, I cut it down to 100k words and fed it to org.languagetool.dev.SpellDictionaryBuilder Now my spelling word list contains ~1.6 million words. And when I run fsa manually on it I get this: FSA implementation : morfologik.fsa.FSA5 Compiled with flags : [FLEXIBLE, STOPBIT, NEXTBIT] Number of arcs : 239614/239614 Number of nodes : 87269 Number of final states : 1623621 which looks right, but when I run this list + frequency file via SpellDictionaryBuilder I get much less states in the output: FSA implementation : morfologik.fsa.CFSA2 Compiled with flags : [FLEXIBLE, STOPBIT, NEXTBIT] Number of arcs : 249742/249742 Number of nodes : 126723 Number of final states : 1013129 So it looks like I get less words in the output, or am I reading it wrong? Thanks Andriy 2013-11-26 9:27 GMT-05:00 Jaume Ortolà i Font <jaumeort...@gmail.com>: > 2013/11/25 Daniel Naber <list2...@danielnaber.de> >> >> On 2013-11-25 11:11, Jaume Ortolà i Font wrote: >> >> > - A method for building the dictionary, assuming that it will be >> > used only for some languages (backward compatible). >> > - A way of using the frequency information in the ordering of >> > suggestions. For example: >> > new distance = current distance *10 + a number between 0 and 9 >> > (A-K). >> >> Sounds good to me. There are lists of word occurrences on the web, maybe >> we can use them if the license is okay. If not, the process of creating >> the list should be reproducible, i.e. it should rely on data that's >> freely available. This might not be so easy, just using Wikipedia >> doesn't seem appropriate because of its style. >> > > Look at these wordlists [1]. They are Apache 2.0. The words are classified > in 256 ranges. > > (The Catalan list is more or less OK. But the tokenization is not the same > as in LT. I can build a better one from other sources, but the corpus data > is not freely available.) > > Regards, > Jaume Ortolà > > [1] https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries. > > > > ------------------------------------------------------------------------------ > Shape the Mobile Experience: Free Subscription > Software experts and developers: Be at the forefront of tech innovation. > Intel(R) Software Adrenaline delivers strategic insight and game-changing > conversations that shape the rapidly evolving mobile landscape. Sign up now. > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel