2012-11-30, Marcin Miłkowski sanoi:

> The idea is not new and there are some prototypes for other
> finite-state machine software (including direct conversion from
> hunspell to FSM formalism, lexc / twolc). Tommy Pirinen has written
> some papers that describe the process, and the software, but the
> software first converts to lexc / twolc,

True, unfortunately I haven't had time to write a proper
version, but if someone's willing to do one I can probably help
whenever I have time.

There's also one I heard from Andras Kornai himself here
<http://www.wpi.edu/Pubs/E-project/Available/E-project-042810-055257/unrestricted/kgreenfield_sjudd_MQP.pdf>,
but I never had the time to check it out correctly, maybe it's more
useful to you.

> which is not directly
> translatable to fsa (they use hfst, whose Java version is not really
> finished). 

The java version for automaton traversal should work, but I don't use
it at all so I cannot say for sure. And it indeed does not include
other functionality (e.g. automata algebra) and that is not planned
atm, since none of us use java for anything–admittedly, last I
seriously touched java it was 1.4

> In the minimal scenario, the only thing that remains to be
> done is to create a lexc 2 fsa converter. This is not rocket science
> but requires some work - there were people who did it:
> 
> yeda.cs.technion.ac.il/~yona/talks/xfst2fsa/xfst2fsa.ps

The lexc part is a one night programming exercise for any competent
programmer (disjunction of strings or suffix tree, plus few extra arcs
here and there), but the twolc might take some quite a lot of time.
Twolc is not really ideal formalism to describe hunspells context
restrictions and deletions but it was the one I had available at time,
though rewriting any other way would probably take some time and a bit
of thinking too. 

If you only need the final automaton (that is, you don't need to
compile aff/dic files on the fly) though, you should be able to use any
of the open source fst applications to compile it and dump the raw graph
into some simple format, such as AT&T's tsv and read from it.

> Alternatively, one could try developing a direct hunspell parser that 
> creates a graph by using the .aff file. This would be a bit cleaner 
> because the conversion to lexc / twolc is only a prototype (and hard
> to compile).

Indeed, it will be lucky if you get it working; for reasons I cannot
remember I wrote that in kludgy flex and yacc back then. The .aff
parsing is not the hard part though, it's just a line based format with
neat, space separated fields, should be trivial in any programming
language. I suspect that implenting this, the only part that requires a
bit of work is to implement the context restrictions and deletions (and
their regexes), especially if you do not yet have system for that yet.

A generic hunspell (~ hunmorph) to fst converter would be a good tool
for a lot of projects, if it gets implemented I'd be interested to see
it in form that can be re-used in e.g. hfst's tools–which basically can
be done with as little as having possibility to dump the graph in some
easily parsable format.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to