W dniu 2013-12-09 00:12, Jaume Ortolà i Font pisze:
> Hi,
>
> I have implemented the use of the frequency word lists, and it works
> well for me. I will make pull requests to both LanguageTool and
> Morfologik projects. In LanguageTool, I added to POSDictionaryBuilder
> the option of reading frequency data from files like ca_wordlist.xml.

Excellent!

> Now there are only minor problems in some tests in Polish and German,
> which use '+' as a separator in the FSA dictionary. The same character
> is used also in some POS tags. If a flag for frequency data is added to
> the dictionary properties (which anyway is advisable), the problem will
> be solved. In fact, with this flag, we could consider that the last byte
> is the frequency data and the separator between POS tag and frequency is
> not needed.

I'd go for the flag. It makes the solution more robust.

Anyway, the spelling dictionary for Polish is different from the tagger 
dictionary so I cannot really see why there could be any problems? There 
are no separators in the spelling dictionary at all.

>
> The other solution is to change the separator...

Why would one change something which does not exist yet? It's not even 
introduced in the spelling dictionaries...

Best,
Marcin

>
> Regards,
> Jaume Ortolà
>
>
>
>
> 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>>
>
>     W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
>      > 2013/11/26 Daniel Naber <list2...@danielnaber.de
>     <mailto:list2...@danielnaber.de>
>      > <mailto:list2...@danielnaber.de <mailto:list2...@danielnaber.de>>>
>      >
>      >     On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
>      >
>      >      > Look at these wordlists [1]. They are Apache 2.0. The
>     words are
>      >      > classified in 256 ranges.
>      >
>      >      > [1]
>      > https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
>      >
>      >     The German one looks okay. Unless the quality is a serious
>     problem for
>      >     some language, I'd suggest to simple use these lists.
>      >
>      >
>      > I think the quality won't be a serious problem (even with the
>      > tokenization differences in Catalan). The goal is just to avoid
>     common
>      > words (usually short ones) being hidden by dozens of other uncommon
>      > words in spelling suggestions. So these wordlists seem good enough.
>      >
>      > Now, we need Marcin to say something about how to add this data
>     to the
>      > FSA dictionaries. I guess we just need to add an extra field (with a
>      > separator) after the POS tag.
>
>     Exactly. Also, there is already some code in the speller that sorts the
>     suggestions and could use the frequency to sort them as well, not only
>     the edit distance (in the CandidateData internal class in Speller).
>
>     The dictionary can already contain separators, so we can freely use a
>     second field. Maybe we should have another flag to say that this field
>     is actually used for frequency.
>
>     Note: in 1.8, we changed the dictionary building process because there
>     was a tiny bug. I am unable to work on existing dictionaries (some are
>     faulty, for example, the Slovak one) until the next week, but then we'll
>     upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat
>     better dictionaries. Interestingly, infix encoding turns out to be
>     suboptimal right now in terms of the automaton size but prefix is
>     very good.
>
>     Sorry for being terse, I have an important event (habilitation exam)
>     tomorrow, so I cannot write more at the moment.
>
>     Best,
>     Marcin
>
>     
> ------------------------------------------------------------------------------
>     Rapidly troubleshoot problems before they affect your business. Most IT
>     organizations don't have a clear picture of how application performance
>     affects their revenue. With AppDynamics, you get 100% visibility
>     into your
>     Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>     AppDynamics Pro!
>     
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>     _______________________________________________
>     Languagetool-devel mailing list
>     Languagetool-devel@lists.sourceforge.net
>     <mailto:Languagetool-devel@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Sponsored by Intel(R) XDK
> Develop, test and display web and hybrid apps with a single code base.
> Download it for free now!
> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to