Michel Weimerskirch pisze:
Hi
I have been playing around with the morphological analysis features of
hunspell(*). Has anyone investigated if and how the PoS data
associated to a hunspell wordlist could be used for a grammar checker?
Yes and no ;)
I mean I've built a Polish tagger using a hunspell wordlist, but current
hunspell features were too limited for Polish: actually, it turned out
that I need to take into account all affix flags to make it work, while
hunspell treats flags atomically. I could rewrite the whole affix file
from scratch but I didn't exactly feel up to it, so I ended up with a
different algorithm that is a little bit more "holistic". If you read
AWK code, you can look it up at:
http://morfologik.svn.sourceforge.net/viewvc/morfologik/scripts/
Start from the Makefile, as the code is not really heavily commented ;)
This is how this "could" work in theory (very rough sketch ;-) ):
- Grammar checker gets a paragraph that is to be checked
- Paragraph is splitted into sentences
- Sentences are splitted into tokens
- The tokens are tagged with lexical categories (data from hunspell wordlist?)
- Grammar rules are applied
Any insight or comments?
The resulting dictionary (created offline due to high memory/time cost)
is about 200MB for Polish, then it's encoded as a finite state automaton
ready for use in LanguageTool (and this makes it into a bearable 2.7MB
file, while hunspell source for Polish is about 4.5MB which is
significantly more). The code for reading fsa files (morfologik in Java
or morph_fsa in C++) is definitely faster than any other text lookup
thanks to the finite state encoding.
So, actually something like that is being used in a real-world app but
it turned out that I either needed to use more data than hunspell would
currently allow, or create new affix file to accommodate for this fact.
I don't know if the latter is a good approach, I never tried it. If
you're skilled in affix file writing, this might be a better idea,
especially because hunspell supports UTF-8 flags and that gives a lot of
flexibility.
I'm still planning to start a major rewrite of affix flag / tagging
rules as the Polish hunspell source has been significantly cleared up
(it contained many duplicates in terms of flags creating the same PoS
tag and the same affix) - the current dictionary is imperfect,
especially for accusative case.
Regards,
Marcin
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]