W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
> 2013/11/26 Daniel Naber <list2...@danielnaber.de
> <mailto:list2...@danielnaber.de>>
>
>     On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
>
>      > Look at these wordlists [1]. They are Apache 2.0. The words are
>      > classified in 256 ranges.
>
>      > [1]
>     https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
>
>     The German one looks okay. Unless the quality is a serious problem for
>     some language, I'd suggest to simple use these lists.
>
>
> I think the quality won't be a serious problem (even with the
> tokenization differences in Catalan). The goal is just to avoid common
> words (usually short ones) being hidden by dozens of other uncommon
> words in spelling suggestions. So these wordlists seem good enough.
>
> Now, we need Marcin to say something about how to add this data to the
> FSA dictionaries. I guess we just need to add an extra field (with a
> separator) after the POS tag.

Exactly. Also, there is already some code in the speller that sorts the 
suggestions and could use the frequency to sort them as well, not only 
the edit distance (in the CandidateData internal class in Speller).

The dictionary can already contain separators, so we can freely use a 
second field. Maybe we should have another flag to say that this field 
is actually used for frequency.

Note: in 1.8, we changed the dictionary building process because there 
was a tiny bug. I am unable to work on existing dictionaries (some are 
faulty, for example, the Slovak one) until the next week, but then we'll 
upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat 
better dictionaries. Interestingly, infix encoding turns out to be 
suboptimal right now in terms of the automaton size but prefix is very good.

Sorry for being terse, I have an important event (habilitation exam) 
tomorrow, so I cannot write more at the moment.

Best,
Marcin

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to