W dniu 2012-11-29 23:49, Daniel Naber pisze: > On 30.05.2012, 10:51:31 Marcin Miłkowski wrote: > > Hi Marcin, > >>> What about compounding and compounding rules? >> >> They are useful to create a wordlist, but the wordlist is anyway finite, >> so it can be represented by the finite-state dictionary (actually, it >> could also represent an infinite dictionary, but we have no simple way >> to create such a binary automaton, i.e. it would have to be written). > > do you have any plans to work on that? I'm asking because I have a use case > that requires fast suggestions for typos in German compounds.
Plans - yes, but definitely not for this release. I have several project that I have to finish this year, and they have taken substantial amount of my time, so I hope I could be more available for LT next year. The idea is not new and there are some prototypes for other finite-state machine software (including direct conversion from hunspell to FSM formalism, lexc / twolc). Tommy Pirinen has written some papers that describe the process, and the software, but the software first converts to lexc / twolc, which is not directly translatable to fsa (they use hfst, whose Java version is not really finished). In the minimal scenario, the only thing that remains to be done is to create a lexc 2 fsa converter. This is not rocket science but requires some work - there were people who did it: yeda.cs.technion.ac.il/~yona/talks/xfst2fsa/xfst2fsa.ps Now, the main problem is that neither fsa nor morfologik-stemming accept any formalism as their input - we would need to provide some kind of graph representation to it, and then use the usual routines to optimize the automaton (=the graph) to make it smaller and faster. So what we need is to represent lexc / twolc files as graphs and feed them to morfologik-stemming. I think there is some code that does this for hfst package, but it's in C++, so it has to be ported to Java, and customized for morfologik-stemming. Alternatively, one could try developing a direct hunspell parser that creates a graph by using the .aff file. This would be a bit cleaner because the conversion to lexc / twolc is only a prototype (and hard to compile). If anyone wants to start working on this, I'd be delighted. Best, Marcin ------------------------------------------------------------------------------ Keep yourself connected to Go Parallel: TUNE You got it built. Now make it sing. Tune shows you how. http://goparallel.sourceforge.net _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel