Re: [Languagetool] hunspell vs. fsa spell

Marcin Miłkowski Fri, 30 Nov 2012 02:57:58 -0800

W dniu 2012-11-29 23:49, Daniel Naber pisze:
> On 30.05.2012, 10:51:31 Marcin Miłkowski wrote:
>
> Hi Marcin,
>
>>> What about compounding and compounding rules?
>>
>> They are useful to create a wordlist, but the wordlist is anyway finite,
>> so it can be represented by the finite-state dictionary (actually, it
>> could also represent an infinite dictionary, but we have no simple way
>> to create such a binary automaton, i.e. it would have to be written).
>
> do you have any plans to work on that? I'm asking because I have a use case
> that requires fast suggestions for typos in German compounds.


Plans - yes, but definitely not for this release. I have several project 
that I have to finish this year, and they have taken substantial amount 
of my time, so I hope I could be more available for LT next year.

The idea is not new and there are some prototypes for other finite-state 
machine software (including direct conversion from hunspell to FSM 
formalism, lexc / twolc). Tommy Pirinen has written some papers that 
describe the process, and the software, but the software first converts 
to lexc / twolc, which is not directly translatable to fsa (they use 
hfst, whose Java version is not really finished). In the minimal 
scenario, the only thing that remains to be done is to create a lexc 2 
fsa converter. This is not rocket science but requires some work - there 
were people who did it:

yeda.cs.technion.ac.il/~yona/talks/xfst2fsa/xfst2fsa.ps

Now, the main problem is that neither fsa nor morfologik-stemming accept 
any formalism as their input - we would need to provide some kind of 
graph representation to it, and then use the usual routines to optimize 
the automaton (=the graph) to make it smaller and faster.

So what we need is to represent lexc / twolc files as graphs and feed 
them to morfologik-stemming. I think there is some code that does this 
for hfst package, but it's in C++, so it has to be ported to Java, and 
customized for morfologik-stemming.

Alternatively, one could try developing a direct hunspell parser that 
creates a graph by using the .aff file. This would be a bit cleaner 
because the conversion to lexc / twolc is only a prototype (and hard to 
compile).

If anyone wants to start working on this, I'd be delighted.

Best,
Marcin

------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
TUNE You got it built. Now make it sing. Tune shows you how.
http://goparallel.sourceforge.net
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] hunspell vs. fsa spell

Reply via email to