It would be easier if you could set up a small github project and:

1) include a 100-lines source file with your data;
2) include the commands you're using to compress this data through the
various pipelines you mentioned.

I suspect SpellDictionaryBuilder preprocesses the input somehow before
it passes it to the fsa; I'd have to take a look at what it does
myself to actually say what's wrong :)

Dawid


On Mon, Dec 22, 2014 at 6:41 PM, Andriy Rysin <ary...@gmail.com> wrote:
> I just doublechecked:
> if I encode my dictionary with fsa_build both using cfsa2 and fsa5
> give me ~1.6 million rows
> if I encode using org.languagetool.dev.SpellDictionaryBuilder with
> frequency information I get ~1 million rows
> if I encode using org.languagetool.dev.SpellDictionaryBuilder without
> frequency information I get 7k rows (!)
>
> Here, head output for 1 million row generated by SpellDictionaryBuilder
>
> FSA properties
> --------------
> FSA implementation     : morfologik.fsa.CFSA2
> Compiled with flags    : [FLEXIBLE, STOPBIT, NEXTBIT]
> Number of arcs         : 249742/249742
> Number of nodes        : 126723
> Number of final states : 1013129
>
> Dictionary metadata
> -------------------
> fsa.dict.separator                      : 0x2b ('+')
> fsa.dict.encoding                       : cp1251
> fsa.dict.frequency-included             : true
> fsa.dict.encoder                        : SUFFIX
>
> Thanks
> Andriy
>
> 2014-12-22 12:10 GMT-05:00 Dawid Weiss <dawid.we...@gmail.com>:
>> I honestly don't know -- I'm not into LT development, I know the FST
>> package. :)  try "head" instead of wc -l and see what the output is?
>>
>> Dawid
>>
>> On Mon, Dec 22, 2014 at 2:09 PM, Andriy Rysin <ary...@gmail.com> wrote:
>>> Thanks, so I ran this (f-uk-UA.dict is file with frequencies):
>>> java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -x -d
>>> f-uk-UA.dict | wc -l
>>> 1013148
>>>
>>> while (uk-UA.dict is old one without frequencies)
>>> java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -d uk-UA.dict | 
>>> wc -l
>>> 1623646
>>>
>>> is this still correct?
>>>
>>> Thanks
>>> Andriy
>>>
>>> 2014-12-22 2:41 GMT-05:00 Dawid Weiss <dawid.we...@gmail.com>:
>>>>> So it looks like I get less words in the output, or am I reading it wrong?
>>>>
>>>> You are reading it wrong, Andriy. A final state does not correspond to
>>>> a unique path through the automaton.
>>>>
>>>> AX
>>>> AY
>>>>
>>>> should have two final states (X, Y), whereas:
>>>>
>>>> AX1
>>>> AY1
>>>>
>>>> will only have one final state (1).
>>>>
>>>> All this is a bit more complicated by the fact that morfologik uses
>>>> final bits on arcs rather than actual states... I'd say don't go into
>>>> it unless you really have to. You can always dump the content of the
>>>> created automaton, it should just return the input used to create it
>>>> -- that's how you know the automaton is valid.
>>>>
>>>> Dawid
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming! The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming! The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to