I would appreciate some help with frequences for spelling dictionary, please.

So I created uk_wordlist.xml for Ukrainian based on huge media
archive, got about 1.5 million words with frequency. Assuming I don't
need that much for spelling, I cut it down to 100k words and fed it to
org.languagetool.dev.SpellDictionaryBuilder
Now my spelling word list contains ~1.6 million words. And when I run
fsa manually on it I get this:

FSA implementation     : morfologik.fsa.FSA5
Compiled with flags    : [FLEXIBLE, STOPBIT, NEXTBIT]
Number of arcs         : 239614/239614
Number of nodes        : 87269
Number of final states : 1623621

which looks right, but when I run this list + frequency file via
SpellDictionaryBuilder I get much less states in the output:

FSA implementation     : morfologik.fsa.CFSA2
Compiled with flags    : [FLEXIBLE, STOPBIT, NEXTBIT]
Number of arcs         : 249742/249742
Number of nodes        : 126723
Number of final states : 1013129

So it looks like I get less words in the output, or am I reading it wrong?

Thanks
Andriy


2013-11-26 9:27 GMT-05:00 Jaume Ortolà i Font <jaumeort...@gmail.com>:
> 2013/11/25 Daniel Naber <list2...@danielnaber.de>
>>
>> On 2013-11-25 11:11, Jaume Ortolà i Font wrote:
>>
>> > -  A method for building the dictionary, assuming that it will be
>> > used only for some languages (backward compatible).
>> > - A way of using the frequency information in the ordering of
>> > suggestions. For example:
>> > new distance = current distance *10 + a number between 0 and 9
>> > (A-K).
>>
>> Sounds good to me. There are lists of word occurrences on the web, maybe
>> we can use them if the license is okay. If not, the process of creating
>> the list should be reproducible, i.e. it should rely on data that's
>> freely available. This might not be so easy, just using Wikipedia
>> doesn't seem appropriate because of its style.
>>
>
> Look at these wordlists [1]. They are Apache 2.0. The words are classified
> in 256 ranges.
>
> (The Catalan list is more or less OK. But the tokenization is not the same
> as in LT. I can build a better one from other sources, but the corpus data
> is not freely available.)
>
> Regards,
> Jaume Ortolà
>
> [1] https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries.
>
>
>
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to