Re: switching from Hunspell to Morfologik

Dawid Weiss Mon, 13 Oct 2014 14:26:53 -0700

Hi Jan,

To be honest I'm not really familiar with LT's code either, so I'm not
sure what the dictionary wrappers in LT are actually doing ;) I just
chipped in because I'm familiar with morfologik-stemming, so I tested
your dictionary and provided my feedback.


> This encoder expands suffixes, similar to Hunspell, with which I am slightly 
> more familiar, right?

Morfologik-stemming is essentially for construction and traversals of
finite state automata, it doesn't understand any "encodings" at lower
level, because at the lower level there are just strings of characters
(or bytes). If you're not a programmer this won't tell you much, I
understand.

Anyway, the "encodings" are built on top of the low-level automata and
are meant to minimize automaton size if tuples or triples of derived
sequences are to be put in one automaton sequence [1].

So something like a pair:

donkeys donkey

will be first "encoded" into:

donkeys+B

which means "to get the stem of the word "donkeys", you need to remove
"B-A=1" characters from the tail of the string.

There are also infix and prefix encoders there, where matters are
slightly more complicated. But like I said -- this is just sugar on
top of the cake; in the end there's always a simple byte sequence
which gets stored in the FSA, the "compression" and its interpretation
are separate things. I think LT folks have agreed that the frequency
data should be stored as:

myword+X

where "X" is the frequency tag. This looks like the encoding above,
can be compressed to an automaton, but is not the same thing. And
obviously, if you try to decode a sequence from an automaton which has
nothing to do with the "encoding" it knows about things will break --
that's what you experienced, unfortunately.

I'm sure LT folks will follow up on how to prepare those frequency
dictionaries correctly ;)

Dawid

[1] These encoding schemes were introduced by Jan Daciuk in his tool fsa, here:
http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html

On Mon, Oct 13, 2014 at 9:31 PM, Jan Schreiber
<jan.schrei...@languagetool.org> wrote:
> Dawid,
>
> thanks a ton for the clarifications. TBH I blindly followed the
> instructions on the Wiki page you mentioned and I wasn't really sure
> what I was doing. Please let's make sure I don't misunderstand you.
>
>> with SUFFIX encoder (which your .info file implicitly picks)
> This encoder expands suffixes, similar to Hunspell, with which I am
> slightly more familiar, right? The word list I used is just a plain list
> with no affix compression whatsoever, hence this feature should
> definitely be turned off. I wasn't even fully aware it existed.
>
> One of the problems here was that I couldn't test the binary dict file
> to see if it worked. Is there a way to use Morfologik for non-programmers?
>
> --Jan
>
>
> Am 13.10.2014 20:29, schrieb Dawid Weiss:
>> This is a valid FSA file, but not a valid encoding for the dictionary
>> you're trying to dump, Jan. That's why you're getting an exception.
>> For example this entry:
>>
>> AAA+I
>>
>> with SUFFIX encoder (which your .info file implicitly picks) this
>> means to truncate 8 bytes from the sequence, which is clearly wrong.
>> It seems to me that you have data that shouldn't be encoded with
>> anything (and isn't) -- perhaps the LT colleagues can follow-up with
>> this one. The wiki page at:
>>
>> http://wiki.languagetool.org/hunspell-support
>>
>> indeed should clarify the encoder property for the associated .info file as:
>>
>> fsa.dict.encoder=NONE
>>
>> if you comment out these obsolete properties from your .info file:
>>
>> #fsa.dict.uses-prefixes=false
>> #fsa.dict.uses-infixes=false
>>
>> and add the above one, the dictionary dumps just fine. In any case,
>> you can always dump *any* FSA dictionary without applying the decoding
>> routines; just use:
>>
>> java -jar morfologik-tools-1.10.0-SNAPSHOT-standalone.jar  fsa_dump -d
>> <dict> --raw-data
>>
>> If you do want to "decode" the data, pass an additional "-x", although
>> if the underlying data doesn't make sense, exceptions may occur (no
>> runtime checks are done to verify sanity for performance reasons).
>>
>> Dawid
>>
>> On Mon, Oct 13, 2014 at 3:59 PM, Jan Schreiber
>> <jan.schrei...@languagetool.org> wrote:
>>> In case anyone's interested in the exported plain text file, it is here:
>>> http://sourceforge.net/projects/germandict/files/Morfologik/de_frequency.7z
>>>
>>> I sorted the words by frequency class and additionally sorted the
>>> largest "A" class of least frequent words by word length.
>>>
>>> The frequency distribution for the first 200,000 words looks fairly
>>> plausible, but the vast majority (about 1.4 million word forms) is
>>> lumped together in one huge class.
>>>
>>> Ruud, you said you have larger frequency data sets available for most of
>>> the languages. If you happen to have data for German available I would
>>> love to have it, ideally in the gaia format so I don't have to hassle
>>> with converting it. But a tab-separated list or something like that
>>> would also be great.
>>>
>>> --Jan
>>>
>>> Am 12.10.2014 18:18, schrieb Jan Schreiber:
>>>> I figured out how to dump the dictionary. All I had to do was create a
>>>> hunspell subfolder and move the binary dictionary into it, then the
>>>> exporting process worked as advertised.
>>>
>>> ------------------------------------------------------------------------------
>>> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
>>> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
>>> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
>>> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
>>> http://p.sf.net/sfu/Zoho
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>> ------------------------------------------------------------------------------
>> Comprehensive Server Monitoring with Site24x7.
>> Monitor 10 servers for $9/Month.
>> Get alerted through email, SMS, voice calls or mobile push notifications.
>> Take corrective actions from your mobile device.
>> http://p.sf.net/sfu/Zoho
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: switching from Hunspell to Morfologik

Reply via email to