Hi,

I have just committed a big patch that introduces an ngram rule 
(EnglishConfusionProbabilityRule). This rule uses a large ngram index to 
check how common a phrase is. If someone uses "their", but "there" is 
much more common in that context, we assume an error. The context 
currently is two words to the left and two words to the right.

The rule will only be activated by the new --languagemodel argument, 
which takes a path to the ngram data. As the ngram data is large (3,8GB 
when compressed), this isn't useful for the common user, but we might 
activate if for our API if we can make it work really well.

Whereas the patch is quite big, most of it is test and evaluation code. 
The actual rule is quite simple, it just looks up the context occurrence 
counts in the data and if the alternative word is clearly more common, 
raises an error. It's not easy to tell how well this works. I did 
evaluations against a dyslexia corpus and optimized for precision and 
recall there, preferring good precision. But even then, the precision is 
quite bad on text which is already quite good, like Wikipedia. So we now 
have two values in ConfusionProbabilityRule (MIN_SCORE_DIFF, 
MIN_ALTERNATIVE_SCORE) which can be used to select between good recall 
and good precision.

If you want to play with this, see 
http://wiki.languagetool.org/finding-errors-using-big-data

Regards
  Daniel


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to