Hi, Jaume and all,

I'm newbie here but I have a question. Why there isn't frequency list
for galician in
https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries ?

2013/12/9 Jaume Ortolà i Font <jaumeort...@gmail.com>:
> Hi,
>
> I have implemented the use of the frequency word lists, and it works well
> for me. I will make pull requests to both LanguageTool and Morfologik
> projects. In LanguageTool, I added to POSDictionaryBuilder the option of
> reading frequency data from files like ca_wordlist.xml.
>
> Now there are only minor problems in some tests in Polish and German, which
> use '+' as a separator in the FSA dictionary. The same character is used
> also in some POS tags. If a flag for frequency data is added to the
> dictionary properties (which anyway is advisable), the problem will be
> solved. In fact, with this flag, we could consider that the last byte is the
> frequency data and the separator between POS tag and frequency is not
> needed.
>
> The other solution is to change the separator...
>
> Regards,
> Jaume Ortolà
>
>
>
>
> 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl>
>>
>> W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
>> > 2013/11/26 Daniel Naber <list2...@danielnaber.de
>> > <mailto:list2...@danielnaber.de>>
>> >
>> >     On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
>> >
>> >      > Look at these wordlists [1]. They are Apache 2.0. The words are
>> >      > classified in 256 ranges.
>> >
>> >      > [1]
>> >
>> > https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
>> >
>> >     The German one looks okay. Unless the quality is a serious problem
>> > for
>> >     some language, I'd suggest to simple use these lists.
>> >
>> >
>> > I think the quality won't be a serious problem (even with the
>> > tokenization differences in Catalan). The goal is just to avoid common
>> > words (usually short ones) being hidden by dozens of other uncommon
>> > words in spelling suggestions. So these wordlists seem good enough.
>> >
>> > Now, we need Marcin to say something about how to add this data to the
>> > FSA dictionaries. I guess we just need to add an extra field (with a
>> > separator) after the POS tag.
>>
>> Exactly. Also, there is already some code in the speller that sorts the
>> suggestions and could use the frequency to sort them as well, not only
>> the edit distance (in the CandidateData internal class in Speller).
>>
>> The dictionary can already contain separators, so we can freely use a
>> second field. Maybe we should have another flag to say that this field
>> is actually used for frequency.
>>
>> Note: in 1.8, we changed the dictionary building process because there
>> was a tiny bug. I am unable to work on existing dictionaries (some are
>> faulty, for example, the Slovak one) until the next week, but then we'll
>> upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat
>> better dictionaries. Interestingly, infix encoding turns out to be
>> suboptimal right now in terms of the automaton size but prefix is very
>> good.
>>
>> Sorry for being terse, I have an important event (habilitation exam)
>> tomorrow, so I cannot write more at the moment.
>>
>> Best,
>> Marcin
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
>
> ------------------------------------------------------------------------------
> Sponsored by Intel(R) XDK
> Develop, test and display web and hybrid apps with a single code base.
> Download it for free now!
> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>



-- 
Antón Méixome - Galician Native Lang Coordination
Galician community LibO & AOO

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to