Dawid, thanks a ton for the clarifications. TBH I blindly followed the instructions on the Wiki page you mentioned and I wasn't really sure what I was doing. Please let's make sure I don't misunderstand you.
> with SUFFIX encoder (which your .info file implicitly picks) This encoder expands suffixes, similar to Hunspell, with which I am slightly more familiar, right? The word list I used is just a plain list with no affix compression whatsoever, hence this feature should definitely be turned off. I wasn't even fully aware it existed. One of the problems here was that I couldn't test the binary dict file to see if it worked. Is there a way to use Morfologik for non-programmers? --Jan Am 13.10.2014 20:29, schrieb Dawid Weiss: > This is a valid FSA file, but not a valid encoding for the dictionary > you're trying to dump, Jan. That's why you're getting an exception. > For example this entry: > > AAA+I > > with SUFFIX encoder (which your .info file implicitly picks) this > means to truncate 8 bytes from the sequence, which is clearly wrong. > It seems to me that you have data that shouldn't be encoded with > anything (and isn't) -- perhaps the LT colleagues can follow-up with > this one. The wiki page at: > > http://wiki.languagetool.org/hunspell-support > > indeed should clarify the encoder property for the associated .info file as: > > fsa.dict.encoder=NONE > > if you comment out these obsolete properties from your .info file: > > #fsa.dict.uses-prefixes=false > #fsa.dict.uses-infixes=false > > and add the above one, the dictionary dumps just fine. In any case, > you can always dump *any* FSA dictionary without applying the decoding > routines; just use: > > java -jar morfologik-tools-1.10.0-SNAPSHOT-standalone.jar fsa_dump -d > <dict> --raw-data > > If you do want to "decode" the data, pass an additional "-x", although > if the underlying data doesn't make sense, exceptions may occur (no > runtime checks are done to verify sanity for performance reasons). > > Dawid > > On Mon, Oct 13, 2014 at 3:59 PM, Jan Schreiber > <jan.schrei...@languagetool.org> wrote: >> In case anyone's interested in the exported plain text file, it is here: >> http://sourceforge.net/projects/germandict/files/Morfologik/de_frequency.7z >> >> I sorted the words by frequency class and additionally sorted the >> largest "A" class of least frequent words by word length. >> >> The frequency distribution for the first 200,000 words looks fairly >> plausible, but the vast majority (about 1.4 million word forms) is >> lumped together in one huge class. >> >> Ruud, you said you have larger frequency data sets available for most of >> the languages. If you happen to have data for German available I would >> love to have it, ideally in the gaia format so I don't have to hassle >> with converting it. But a tab-separated list or something like that >> would also be great. >> >> --Jan >> >> Am 12.10.2014 18:18, schrieb Jan Schreiber: >>> I figured out how to dump the dictionary. All I had to do was create a >>> hunspell subfolder and move the binary dictionary into it, then the >>> exporting process worked as advertised. >> >> ------------------------------------------------------------------------------ >> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer >> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports >> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper >> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer >> http://p.sf.net/sfu/Zoho >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ > Comprehensive Server Monitoring with Site24x7. > Monitor 10 servers for $9/Month. > Get alerted through email, SMS, voice calls or mobile push notifications. > Take corrective actions from your mobile device. > http://p.sf.net/sfu/Zoho > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel