So I've made frequency work for spelling suggestions but I have two questions:
1) before I added frequency my spelling .dict was ~600k, with frequency added (I added frequency to all 1.6 million words) it's now 1.8M; maybe that's ok, but I just wanted to doublecheck if I am doing everything right 2) I would like to suggest adding -o <outputFile> option to *DictionaryBuilder, as writing output to random tmp file makes scripting harder; I already have local changes working (using apache.cli library) so if nobody objects I will clean up the code and will push it in (the option will be optional so old way will still work) Thanks Andriy 2014-12-22 15:25 GMT-05:00 Andriy Rysin <ary...@gmail.com>: > Ok, I've debugged it a bit and I found a problem: I was assuming it > expects the input dictionary in utf-8 (like the frequency list) but it > wants it using my output encoding. I converted my input into cp1251 > and now I get 1.6 million states I expect. > > Thanks, > Andriy > > FSA properties > -------------- > FSA implementation : morfologik.fsa.CFSA2 > Compiled with flags : [FLEXIBLE, STOPBIT, NEXTBIT] > Number of arcs : 440481/440481 > Number of nodes : 150498 > Number of final states : 1623702 > > 2014-12-22 13:20 GMT-05:00 Dawid Weiss <dawid.we...@gmail.com>: >> It would be easier if you could set up a small github project and: >> >> 1) include a 100-lines source file with your data; >> 2) include the commands you're using to compress this data through the >> various pipelines you mentioned. >> >> I suspect SpellDictionaryBuilder preprocesses the input somehow before >> it passes it to the fsa; I'd have to take a look at what it does >> myself to actually say what's wrong :) >> >> Dawid >> >> >> On Mon, Dec 22, 2014 at 6:41 PM, Andriy Rysin <ary...@gmail.com> wrote: >>> I just doublechecked: >>> if I encode my dictionary with fsa_build both using cfsa2 and fsa5 >>> give me ~1.6 million rows >>> if I encode using org.languagetool.dev.SpellDictionaryBuilder with >>> frequency information I get ~1 million rows >>> if I encode using org.languagetool.dev.SpellDictionaryBuilder without >>> frequency information I get 7k rows (!) >>> >>> Here, head output for 1 million row generated by SpellDictionaryBuilder >>> >>> FSA properties >>> -------------- >>> FSA implementation : morfologik.fsa.CFSA2 >>> Compiled with flags : [FLEXIBLE, STOPBIT, NEXTBIT] >>> Number of arcs : 249742/249742 >>> Number of nodes : 126723 >>> Number of final states : 1013129 >>> >>> Dictionary metadata >>> ------------------- >>> fsa.dict.separator : 0x2b ('+') >>> fsa.dict.encoding : cp1251 >>> fsa.dict.frequency-included : true >>> fsa.dict.encoder : SUFFIX >>> >>> Thanks >>> Andriy >>> >>> 2014-12-22 12:10 GMT-05:00 Dawid Weiss <dawid.we...@gmail.com>: >>>> I honestly don't know -- I'm not into LT development, I know the FST >>>> package. :) try "head" instead of wc -l and see what the output is? >>>> >>>> Dawid >>>> >>>> On Mon, Dec 22, 2014 at 2:09 PM, Andriy Rysin <ary...@gmail.com> wrote: >>>>> Thanks, so I ran this (f-uk-UA.dict is file with frequencies): >>>>> java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -x -d >>>>> f-uk-UA.dict | wc -l >>>>> 1013148 >>>>> >>>>> while (uk-UA.dict is old one without frequencies) >>>>> java -jar ./mfl/morfologik-tools-*-standalone.jar fsa_dump -d uk-UA.dict >>>>> | wc -l >>>>> 1623646 >>>>> >>>>> is this still correct? >>>>> >>>>> Thanks >>>>> Andriy >>>>> >>>>> 2014-12-22 2:41 GMT-05:00 Dawid Weiss <dawid.we...@gmail.com>: >>>>>>> So it looks like I get less words in the output, or am I reading it >>>>>>> wrong? >>>>>> >>>>>> You are reading it wrong, Andriy. A final state does not correspond to >>>>>> a unique path through the automaton. >>>>>> >>>>>> AX >>>>>> AY >>>>>> >>>>>> should have two final states (X, Y), whereas: >>>>>> >>>>>> AX1 >>>>>> AY1 >>>>>> >>>>>> will only have one final state (1). >>>>>> >>>>>> All this is a bit more complicated by the fact that morfologik uses >>>>>> final bits on arcs rather than actual states... I'd say don't go into >>>>>> it unless you really have to. You can always dump the content of the >>>>>> created automaton, it should just return the input used to create it >>>>>> -- that's how you know the automaton is valid. >>>>>> >>>>>> Dawid >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >>>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >>>>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more >>>>>> Get technology previously reserved for billion-dollar corporations, FREE >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> Languagetool-devel@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> Languagetool-devel@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> ------------------------------------------------------------------------------ >>>> Dive into the World of Parallel Programming! The Go Parallel Website, >>>> sponsored by Intel and developed in partnership with Slashdot Media, is >>>> your >>>> hub for all things parallel software development, from weekly thought >>>> leadership blogs to news, videos, case studies, tutorials and more. Take a >>>> look and join the conversation now. http://goparallel.sourceforge.net >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Languagetool-devel@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >>> ------------------------------------------------------------------------------ >>> Dive into the World of Parallel Programming! The Go Parallel Website, >>> sponsored by Intel and developed in partnership with Slashdot Media, is your >>> hub for all things parallel software development, from weekly thought >>> leadership blogs to news, videos, case studies, tutorials and more. Take a >>> look and join the conversation now. http://goparallel.sourceforge.net >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming! The Go Parallel Website, >> sponsored by Intel and developed in partnership with Slashdot Media, is your >> hub for all things parallel software development, from weekly thought >> leadership blogs to news, videos, case studies, tutorials and more. Take a >> look and join the conversation now. http://goparallel.sourceforge.net >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Dive into the World of Parallel Programming! The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel