Hi,

I have implemented the flag for frequency data.

There is a pull request in Morfologik. In LanguagTool I have published a
branch (which probably was unnecessary, I'm afraid). Take a look. Perhaps
there are things you prefer to code differently.

Now, we need to merge the code in LT, a new version of Morfologik. And then
we'll be able to rebuild the dictionaries and adjust the tests if needed.

Regards,
Jaume Ortolà




2013/12/9 Marcin Miłkowski <list-addr...@wp.pl>

> W dniu 2013-12-09 00:12, Jaume Ortolà i Font pisze:
> > Hi,
> >
> > I have implemented the use of the frequency word lists, and it works
> > well for me. I will make pull requests to both LanguageTool and
> > Morfologik projects. In LanguageTool, I added to POSDictionaryBuilder
> > the option of reading frequency data from files like ca_wordlist.xml.
>
> Excellent!
>
> > Now there are only minor problems in some tests in Polish and German,
> > which use '+' as a separator in the FSA dictionary. The same character
> > is used also in some POS tags. If a flag for frequency data is added to
> > the dictionary properties (which anyway is advisable), the problem will
> > be solved. In fact, with this flag, we could consider that the last byte
> > is the frequency data and the separator between POS tag and frequency is
> > not needed.
>
> I'd go for the flag. It makes the solution more robust.
>
> Anyway, the spelling dictionary for Polish is different from the tagger
> dictionary so I cannot really see why there could be any problems? There
> are no separators in the spelling dictionary at all.
>
> >
> > The other solution is to change the separator...
>
> Why would one change something which does not exist yet? It's not even
> introduced in the spelling dictionaries...
>
> Best,
> Marcin
>
> >
> > Regards,
> > Jaume Ortolà
> >
> >
> >
> >
> > 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl <mailto:
> list-addr...@wp.pl>>
> >
> >     W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
> >      > 2013/11/26 Daniel Naber <list2...@danielnaber.de
> >     <mailto:list2...@danielnaber.de>
> >      > <mailto:list2...@danielnaber.de <mailto:list2...@danielnaber.de
> >>>
> >      >
> >      >     On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
> >      >
> >      >      > Look at these wordlists [1]. They are Apache 2.0. The
> >     words are
> >      >      > classified in 256 ranges.
> >      >
> >      >      > [1]
> >      >
> https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
> >      >
> >      >     The German one looks okay. Unless the quality is a serious
> >     problem for
> >      >     some language, I'd suggest to simple use these lists.
> >      >
> >      >
> >      > I think the quality won't be a serious problem (even with the
> >      > tokenization differences in Catalan). The goal is just to avoid
> >     common
> >      > words (usually short ones) being hidden by dozens of other
> uncommon
> >      > words in spelling suggestions. So these wordlists seem good
> enough.
> >      >
> >      > Now, we need Marcin to say something about how to add this data
> >     to the
> >      > FSA dictionaries. I guess we just need to add an extra field
> (with a
> >      > separator) after the POS tag.
> >
> >     Exactly. Also, there is already some code in the speller that sorts
> the
> >     suggestions and could use the frequency to sort them as well, not
> only
> >     the edit distance (in the CandidateData internal class in Speller).
> >
> >     The dictionary can already contain separators, so we can freely use a
> >     second field. Maybe we should have another flag to say that this
> field
> >     is actually used for frequency.
> >
> >     Note: in 1.8, we changed the dictionary building process because
> there
> >     was a tiny bug. I am unable to work on existing dictionaries (some
> are
> >     faulty, for example, the Slovak one) until the next week, but then
> we'll
> >     upgrade to morfologik-stemming 1.8 to remove the bugs and have
> somewhat
> >     better dictionaries. Interestingly, infix encoding turns out to be
> >     suboptimal right now in terms of the automaton size but prefix is
> >     very good.
> >
> >     Sorry for being terse, I have an important event (habilitation exam)
> >     tomorrow, so I cannot write more at the moment.
> >
> >     Best,
> >     Marcin
> >
> >
> ------------------------------------------------------------------------------
> >     Rapidly troubleshoot problems before they affect your business. Most
> IT
> >     organizations don't have a clear picture of how application
> performance
> >     affects their revenue. With AppDynamics, you get 100% visibility
> >     into your
> >     Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> >     AppDynamics Pro!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> >     _______________________________________________
> >     Languagetool-devel mailing list
> >     Languagetool-devel@lists.sourceforge.net
> >     <mailto:Languagetool-devel@lists.sourceforge.net>
> >     https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Sponsored by Intel(R) XDK
> > Develop, test and display web and hybrid apps with a single code base.
> > Download it for free now!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
> >
> >
> >
> > _______________________________________________
> > Languagetool-devel mailing list
> > Languagetool-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >
>
>
>
> ------------------------------------------------------------------------------
> Sponsored by Intel(R) XDK
> Develop, test and display web and hybrid apps with a single code base.
> Download it for free now!
>
> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to