The corpus data does not have to be free, as long as you don't reproduce
the data, do your own process and add intelligence.

Anyway, I collected rather reliable frequency data from Internet for lots
of languages. I am willing to 'free' the counting data on some reasonable
conditions.

These countings were created by only counting words that are within a
context of 2 predecessive correct and 2 successive correct words.

The word in the middle does not have to be correct necessarily.

Ruud


> 2013/11/25 Daniel Naber <list2...@danielnaber.de>
>
>> On 2013-11-25 11:11, Jaume Ortolà i Font wrote:
>>
>> > -  A method for building the dictionary, assuming that it will be
>> > used only for some languages (backward compatible).
>> > - A way of using the frequency information in the ordering of
>> > suggestions. For example:
>> > new distance = current distance *10 + a number between 0 and 9
>> > (A-K).
>>
>> Sounds good to me. There are lists of word occurrences on the web, maybe
>> we can use them if the license is okay. If not, the process of creating
>> the list should be reproducible, i.e. it should rely on data that's
>> freely available. This might not be so easy, just using Wikipedia
>> doesn't seem appropriate because of its style.
>>
>>
> Look at these wordlists [1]. They are Apache 2.0. The words are classified
> in 256 ranges.
>
> (The Catalan list is more or less OK. But the tokenization is not the same
> as in LT. I can build a better one from other sources, but the corpus data
> is not freely available.)
>
> Regards,
> Jaume Ortolà
>
> [1] https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries.
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up
> now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk_______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>



------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to