Michel Weimerskirch pisze:
Hi

I have been playing around with the morphological analysis features of
hunspell(*). Has anyone investigated if and how the PoS data
associated to a hunspell wordlist could be used for a grammar checker?

Yes and no ;)

I mean I've built a Polish tagger using a hunspell wordlist, but current hunspell features were too limited for Polish: actually, it turned out that I need to take into account all affix flags to make it work, while hunspell treats flags atomically. I could rewrite the whole affix file from scratch but I didn't exactly feel up to it, so I ended up with a different algorithm that is a little bit more "holistic". If you read AWK code, you can look it up at:

http://morfologik.svn.sourceforge.net/viewvc/morfologik/scripts/

Start from the Makefile, as the code is not really heavily commented ;)

This is how this "could" work in theory (very rough sketch ;-) ):
- Grammar checker gets a paragraph that is to be checked
- Paragraph is splitted into sentences
- Sentences are splitted into tokens
- The tokens are tagged with lexical categories (data from hunspell wordlist?)
- Grammar rules are applied

Any insight or comments?

The resulting dictionary (created offline due to high memory/time cost) is about 200MB for Polish, then it's encoded as a finite state automaton ready for use in LanguageTool (and this makes it into a bearable 2.7MB file, while hunspell source for Polish is about 4.5MB which is significantly more). The code for reading fsa files (morfologik in Java or morph_fsa in C++) is definitely faster than any other text lookup thanks to the finite state encoding.

So, actually something like that is being used in a real-world app but it turned out that I either needed to use more data than hunspell would currently allow, or create new affix file to accommodate for this fact. I don't know if the latter is a good approach, I never tried it. If you're skilled in affix file writing, this might be a better idea, especially because hunspell supports UTF-8 flags and that gives a lot of flexibility.

I'm still planning to start a major rewrite of affix flag / tagging rules as the Polish hunspell source has been significantly cleared up (it contained many duplicates in terms of flags creating the same PoS tag and the same affix) - the current dictionary is imperfect, especially for accusative case.

Regards,
Marcin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to