The corpus data does not have to be free, as long as you don't reproduce the data, do your own process and add intelligence.
Anyway, I collected rather reliable frequency data from Internet for lots of languages. I am willing to 'free' the counting data on some reasonable conditions. These countings were created by only counting words that are within a context of 2 predecessive correct and 2 successive correct words. The word in the middle does not have to be correct necessarily. Ruud > 2013/11/25 Daniel Naber <list2...@danielnaber.de> > >> On 2013-11-25 11:11, Jaume Ortolà i Font wrote: >> >> > - A method for building the dictionary, assuming that it will be >> > used only for some languages (backward compatible). >> > - A way of using the frequency information in the ordering of >> > suggestions. For example: >> > new distance = current distance *10 + a number between 0 and 9 >> > (A-K). >> >> Sounds good to me. There are lists of word occurrences on the web, maybe >> we can use them if the license is okay. If not, the process of creating >> the list should be reproducible, i.e. it should rely on data that's >> freely available. This might not be so easy, just using Wikipedia >> doesn't seem appropriate because of its style. >> >> > Look at these wordlists [1]. They are Apache 2.0. The words are classified > in 256 ranges. > > (The Catalan list is more or less OK. But the tokenization is not the same > as in LT. I can build a better one from other sources, but the corpus data > is not freely available.) > > Regards, > Jaume Ortolà > > [1] https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries. > ------------------------------------------------------------------------------ > Shape the Mobile Experience: Free Subscription > Software experts and developers: Be at the forefront of tech innovation. > Intel(R) Software Adrenaline delivers strategic insight and game-changing > conversations that shape the rapidly evolving mobile landscape. Sign up > now. > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk_______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel