Hi,

I've just committed a German rule for subject verb agreement and I'm 
posting it here because it uses an approach that might be useful for 
other languages, too.

You can find the documentation at 
http://wiki.languagetool.org/german-agreement-check

The most interesting part is probably the chunker, i.e. the detection of 
phrases. I tried OpenNLP with its stochastic chunker and it worked quite 
well, but it finds small chunks, not complex ones. For agreement check, 
we need complex chunks like "das große Haus und der Garten": "das große 
Haus" is one chunk, "der Garten" is another chunk, together they are one 
complex chunk. So on top of OpenNLP, rules are needed to find these 
complex chunks. It turned out that when you use rules to detect complex 
chunks, you can as well try to replace the OpenNLP chunker completely 
with some more rules. This avoids LT getting larger by another 10MB (the 
size of the models used by OpenNLP).

The rules are expressed in OpenRegex syntax, which is similar to what LT 
does in its patterns, but it's very compact. You can look at some 
patterns here:

https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/de/src/main/java/org/languagetool/chunking/GermanChunker.java#L94

Unlike LT, this is also a real regular expression syntax, i.e. you can 
use operators like *, +, and ? with the semantics from regular 
expressions and you can nest expressions with parenthesis. Currently, 
this is a dependency only for German, but if you want to use this in 
your language to detect chunks or for something else, we could move it 
to core.

Regards
  Daniel


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to