2012-11-30, Marcin Miłkowski sanoi: > The idea is not new and there are some prototypes for other > finite-state machine software (including direct conversion from > hunspell to FSM formalism, lexc / twolc). Tommy Pirinen has written > some papers that describe the process, and the software, but the > software first converts to lexc / twolc,
True, unfortunately I haven't had time to write a proper version, but if someone's willing to do one I can probably help whenever I have time. There's also one I heard from Andras Kornai himself here <http://www.wpi.edu/Pubs/E-project/Available/E-project-042810-055257/unrestricted/kgreenfield_sjudd_MQP.pdf>, but I never had the time to check it out correctly, maybe it's more useful to you. > which is not directly > translatable to fsa (they use hfst, whose Java version is not really > finished). The java version for automaton traversal should work, but I don't use it at all so I cannot say for sure. And it indeed does not include other functionality (e.g. automata algebra) and that is not planned atm, since none of us use java for anything–admittedly, last I seriously touched java it was 1.4 > In the minimal scenario, the only thing that remains to be > done is to create a lexc 2 fsa converter. This is not rocket science > but requires some work - there were people who did it: > > yeda.cs.technion.ac.il/~yona/talks/xfst2fsa/xfst2fsa.ps The lexc part is a one night programming exercise for any competent programmer (disjunction of strings or suffix tree, plus few extra arcs here and there), but the twolc might take some quite a lot of time. Twolc is not really ideal formalism to describe hunspells context restrictions and deletions but it was the one I had available at time, though rewriting any other way would probably take some time and a bit of thinking too. If you only need the final automaton (that is, you don't need to compile aff/dic files on the fly) though, you should be able to use any of the open source fst applications to compile it and dump the raw graph into some simple format, such as AT&T's tsv and read from it. > Alternatively, one could try developing a direct hunspell parser that > creates a graph by using the .aff file. This would be a bit cleaner > because the conversion to lexc / twolc is only a prototype (and hard > to compile). Indeed, it will be lucky if you get it working; for reasons I cannot remember I wrote that in kludgy flex and yacc back then. The .aff parsing is not the hard part though, it's just a line based format with neat, space separated fields, should be trivial in any programming language. I suspect that implenting this, the only part that requires a bit of work is to implement the context restrictions and deletions (and their regexes), especially if you do not yet have system for that yet. A generic hunspell (~ hunmorph) to fst converter would be a good tool for a lot of projects, if it gets implemented I'd be interested to see it in form that can be re-used in e.g. hfst's tools–which basically can be done with as little as having possibility to dump the graph in some easily parsable format. -- Flammie, computer scientist bachelor, linguist master, free software Finnish localiser, and more! <http://www.iki.fi/flammie/> ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel