Hi,
The implementation of this is complete.
Now frequency can be added easily to a spelling or tagging dictionary. In
Catalan, we use the tagging dictionary because there is only one dictionary
for both spelling and tagging. In the rest of languages, the frequency data
must be attached to the spelling dictionary.
For example, for building the American English spelling dictionary, we
should add a line to the en_US.info file:
fsa.dict.frequency-included=true
And then:
java -cp languagetool.jar org.languagetool.dev.SpellDictionaryBuilder en-US
dictionary.dump en_US.info en_us_wordlist.xml
This can be done for most languages, but some adjustments in the tests will
be needed.
Regards,
Jaume Ortolà
2013/12/9 Jaume Ortolà i Font <jaumeort...@gmail.com>
> Hi,
>
> I have implemented the flag for frequency data.
>
> There is a pull request in Morfologik. In LanguagTool I have published a
> branch (which probably was unnecessary, I'm afraid). Take a look. Perhaps
> there are things you prefer to code differently.
>
> Now, we need to merge the code in LT, a new version of Morfologik. And
> then we'll be able to rebuild the dictionaries and adjust the tests if
> needed.
>
> Regards,
> Jaume Ortolà
>
>
>
>
> 2013/12/9 Marcin Miłkowski <list-addr...@wp.pl>
>
>> W dniu 2013-12-09 00:12, Jaume Ortolà i Font pisze:
>> > Hi,
>> >
>> > I have implemented the use of the frequency word lists, and it works
>> > well for me. I will make pull requests to both LanguageTool and
>> > Morfologik projects. In LanguageTool, I added to POSDictionaryBuilder
>> > the option of reading frequency data from files like ca_wordlist.xml.
>>
>> Excellent!
>>
>> > Now there are only minor problems in some tests in Polish and German,
>> > which use '+' as a separator in the FSA dictionary. The same character
>> > is used also in some POS tags. If a flag for frequency data is added to
>> > the dictionary properties (which anyway is advisable), the problem will
>> > be solved. In fact, with this flag, we could consider that the last byte
>> > is the frequency data and the separator between POS tag and frequency is
>> > not needed.
>>
>> I'd go for the flag. It makes the solution more robust.
>>
>> Anyway, the spelling dictionary for Polish is different from the tagger
>> dictionary so I cannot really see why there could be any problems? There
>> are no separators in the spelling dictionary at all.
>>
>> >
>> > The other solution is to change the separator...
>>
>> Why would one change something which does not exist yet? It's not even
>> introduced in the spelling dictionaries...
>>
>> Best,
>> Marcin
>>
>> >
>> > Regards,
>> > Jaume Ortolà
>> >
>> >
>> >
>> >
>> > 2013/11/26 Marcin Miłkowski <list-addr...@wp.pl <mailto:
>> list-addr...@wp.pl>>
>> >
>> > W dniu 2013-11-26 18:44, Jaume Ortolà i Font pisze:
>> > > 2013/11/26 Daniel Naber <list2...@danielnaber.de
>> > <mailto:list2...@danielnaber.de>
>> > > <mailto:list2...@danielnaber.de <mailto:list2...@danielnaber.de
>> >>>
>> > >
>> > > On 2013-11-26 15:27, Jaume Ortolà i Font wrote:
>> > >
>> > > > Look at these wordlists [1]. They are Apache 2.0. The
>> > words are
>> > > > classified in 256 ranges.
>> > >
>> > > > [1]
>> > >
>> https://github.com/mozilla-b2g/gaia/tree/master/keyboard/dictionaries
>> > >
>> > > The German one looks okay. Unless the quality is a serious
>> > problem for
>> > > some language, I'd suggest to simple use these lists.
>> > >
>> > >
>> > > I think the quality won't be a serious problem (even with the
>> > > tokenization differences in Catalan). The goal is just to avoid
>> > common
>> > > words (usually short ones) being hidden by dozens of other
>> uncommon
>> > > words in spelling suggestions. So these wordlists seem good
>> enough.
>> > >
>> > > Now, we need Marcin to say something about how to add this data
>> > to the
>> > > FSA dictionaries. I guess we just need to add an extra field
>> (with a
>> > > separator) after the POS tag.
>> >
>> > Exactly. Also, there is already some code in the speller that sorts
>> the
>> > suggestions and could use the frequency to sort them as well, not
>> only
>> > the edit distance (in the CandidateData internal class in Speller).
>> >
>> > The dictionary can already contain separators, so we can freely use
>> a
>> > second field. Maybe we should have another flag to say that this
>> field
>> > is actually used for frequency.
>> >
>> > Note: in 1.8, we changed the dictionary building process because
>> there
>> > was a tiny bug. I am unable to work on existing dictionaries (some
>> are
>> > faulty, for example, the Slovak one) until the next week, but then
>> we'll
>> > upgrade to morfologik-stemming 1.8 to remove the bugs and have
>> somewhat
>> > better dictionaries. Interestingly, infix encoding turns out to be
>> > suboptimal right now in terms of the automaton size but prefix is
>> > very good.
>> >
>> > Sorry for being terse, I have an important event (habilitation exam)
>> > tomorrow, so I cannot write more at the moment.
>> >
>> > Best,
>> > Marcin
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Rapidly troubleshoot problems before they affect your business.
>> Most IT
>> > organizations don't have a clear picture of how application
>> performance
>> > affects their revenue. With AppDynamics, you get 100% visibility
>> > into your
>> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>> > AppDynamics Pro!
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Languagetool-devel mailing list
>> > Languagetool-devel@lists.sourceforge.net
>> > <mailto:Languagetool-devel@lists.sourceforge.net>
>> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>> >
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Sponsored by Intel(R) XDK
>> > Develop, test and display web and hybrid apps with a single code base.
>> > Download it for free now!
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
>> >
>> >
>> >
>> > _______________________________________________
>> > Languagetool-devel mailing list
>> > Languagetool-devel@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>> >
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Sponsored by Intel(R) XDK
>> Develop, test and display web and hybrid apps with a single code base.
>> Download it for free now!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel