Re: [lingu-dev] Hunspell morphological analysis and grammar checker

Marcin Miłkowski Sun, 25 May 2008 08:40:57 -0700

Michel Weimerskirch pisze:

Hi


I have been playing around with the morphological analysis features of
hunspell(*). Has anyone investigated if and how the PoS data
associated to a hunspell wordlist could be used for a grammar checker?


Yes and no ;)

I mean I've built a Polish tagger using a hunspell wordlist, but currenthunspell features were too limited for Polish: actually, it turned outthat I need to take into account all affix flags to make it work, whilehunspell treats flags atomically. I could rewrite the whole affix filefrom scratch but I didn't exactly feel up to it, so I ended up with adifferent algorithm that is a little bit more "holistic". If you readAWK code, you can look it up at:


http://morfologik.svn.sourceforge.net/viewvc/morfologik/scripts/

Start from the Makefile, as the code is not really heavily commented ;)

This is how this "could" work in theory (very rough sketch ;-) ):
- Grammar checker gets a paragraph that is to be checked
- Paragraph is splitted into sentences
- Sentences are splitted into tokens
- The tokens are tagged with lexical categories (data from hunspell wordlist?)
- Grammar rules are applied

Any insight or comments?

The resulting dictionary (created offline due to high memory/time cost)is about 200MB for Polish, then it's encoded as a finite state automatonready for use in LanguageTool (and this makes it into a bearable 2.7MBfile, while hunspell source for Polish is about 4.5MB which issignificantly more). The code for reading fsa files (morfologik in Javaor morph_fsa in C++) is definitely faster than any other text lookupthanks to the finite state encoding.

So, actually something like that is being used in a real-world app butit turned out that I either needed to use more data than hunspell wouldcurrently allow, or create new affix file to accommodate for this fact.I don't know if the latter is a good approach, I never tried it. Ifyou're skilled in affix file writing, this might be a better idea,especially because hunspell supports UTF-8 flags and that gives a lot offlexibility.

I'm still planning to start a major rewrite of affix flag / taggingrules as the Polish hunspell source has been significantly cleared up(it contained many duplicates in terms of flags creating the same PoStag and the same affix) - the current dictionary is imperfect,especially for accusative case.


Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Hunspell morphological analysis and grammar checker

Reply via email to