Dawid,

thanks a ton for the clarifications. TBH I blindly followed the
instructions on the Wiki page you mentioned and I wasn't really sure
what I was doing. Please let's make sure I don't misunderstand you.

> with SUFFIX encoder (which your .info file implicitly picks)
This encoder expands suffixes, similar to Hunspell, with which I am
slightly more familiar, right? The word list I used is just a plain list
with no affix compression whatsoever, hence this feature should
definitely be turned off. I wasn't even fully aware it existed.

One of the problems here was that I couldn't test the binary dict file
to see if it worked. Is there a way to use Morfologik for non-programmers?

--Jan


Am 13.10.2014 20:29, schrieb Dawid Weiss:
> This is a valid FSA file, but not a valid encoding for the dictionary
> you're trying to dump, Jan. That's why you're getting an exception.
> For example this entry:
> 
> AAA+I
> 
> with SUFFIX encoder (which your .info file implicitly picks) this
> means to truncate 8 bytes from the sequence, which is clearly wrong.
> It seems to me that you have data that shouldn't be encoded with
> anything (and isn't) -- perhaps the LT colleagues can follow-up with
> this one. The wiki page at:
> 
> http://wiki.languagetool.org/hunspell-support
> 
> indeed should clarify the encoder property for the associated .info file as:
> 
> fsa.dict.encoder=NONE
> 
> if you comment out these obsolete properties from your .info file:
> 
> #fsa.dict.uses-prefixes=false
> #fsa.dict.uses-infixes=false
> 
> and add the above one, the dictionary dumps just fine. In any case,
> you can always dump *any* FSA dictionary without applying the decoding
> routines; just use:
> 
> java -jar morfologik-tools-1.10.0-SNAPSHOT-standalone.jar  fsa_dump -d
> <dict> --raw-data
> 
> If you do want to "decode" the data, pass an additional "-x", although
> if the underlying data doesn't make sense, exceptions may occur (no
> runtime checks are done to verify sanity for performance reasons).
> 
> Dawid
> 
> On Mon, Oct 13, 2014 at 3:59 PM, Jan Schreiber
> <jan.schrei...@languagetool.org> wrote:
>> In case anyone's interested in the exported plain text file, it is here:
>> http://sourceforge.net/projects/germandict/files/Morfologik/de_frequency.7z
>>
>> I sorted the words by frequency class and additionally sorted the
>> largest "A" class of least frequent words by word length.
>>
>> The frequency distribution for the first 200,000 words looks fairly
>> plausible, but the vast majority (about 1.4 million word forms) is
>> lumped together in one huge class.
>>
>> Ruud, you said you have larger frequency data sets available for most of
>> the languages. If you happen to have data for German available I would
>> love to have it, ideally in the gaia format so I don't have to hassle
>> with converting it. But a tab-separated list or something like that
>> would also be great.
>>
>> --Jan
>>
>> Am 12.10.2014 18:18, schrieb Jan Schreiber:
>>> I figured out how to dump the dictionary. All I had to do was create a
>>> hunspell subfolder and move the binary dictionary into it, then the
>>> exporting process worked as advertised.
>>
>> ------------------------------------------------------------------------------
>> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
>> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
>> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
>> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
>> http://p.sf.net/sfu/Zoho
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> 
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> 

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to