Hi,
I have implemented the use of the frequency word lists, and it works well
for me. I will make pull requests to both LanguageTool and Morfologik
projects. In LanguageTool, I added to POSDictionaryBuilder the option of
reading frequency data from files like ca_wordlist.xml.
Now there are only minor problems in some tests in Polish and German, which
use '+' as a separator in the FSA dictionary. The same character is used
also in some POS tags. If a flag for frequency data is added to the
dictionary properties (which anyway is advisable), the problem will be
solved. In fact, with this flag, we could consider that the last byte is
the frequency data and the separator between POS tag and frequency is not
needed.
The other solution is to change the separator...
Regards,
Jaume Ortolà
2013/11/26 Marcin Miłkowski <list-addr...@wp.pl>
> W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
> > 2013/11/26 Daniel Naber <list2...@danielnaber.de
> > <mailto:list2...@danielnaber.de>>
> >
> > On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
> >
> > > Look at these wordlists [1]. They are Apache 2.0. The words are
> > > classified in 256 ranges.
> >
> > > [1]
> >
> https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
> >
> > The German one looks okay. Unless the quality is a serious problem
> for
> > some language, I'd suggest to simple use these lists.
> >
> >
> > I think the quality won't be a serious problem (even with the
> > tokenization differences in Catalan). The goal is just to avoid common
> > words (usually short ones) being hidden by dozens of other uncommon
> > words in spelling suggestions. So these wordlists seem good enough.
> >
> > Now, we need Marcin to say something about how to add this data to the
> > FSA dictionaries. I guess we just need to add an extra field (with a
> > separator) after the POS tag.
>
> Exactly. Also, there is already some code in the speller that sorts the
> suggestions and could use the frequency to sort them as well, not only
> the edit distance (in the CandidateData internal class in Speller).
>
> The dictionary can already contain separators, so we can freely use a
> second field. Maybe we should have another flag to say that this field
> is actually used for frequency.
>
> Note: in 1.8, we changed the dictionary building process because there
> was a tiny bug. I am unable to work on existing dictionaries (some are
> faulty, for example, the Slovak one) until the next week, but then we'll
> upgrade to morfologik-stemming 1.8 to remove the bugs and have somewhat
> better dictionaries. Interestingly, infix encoding turns out to be
> suboptimal right now in terms of the automaton size but prefix is very
> good.
>
> Sorry for being terse, I have an important event (habilitation exam)
> tomorrow, so I cannot write more at the moment.
>
> Best,
> Marcin
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel