Hi, I have just committed a big patch that introduces an ngram rule (EnglishConfusionProbabilityRule). This rule uses a large ngram index to check how common a phrase is. If someone uses "their", but "there" is much more common in that context, we assume an error. The context currently is two words to the left and two words to the right.
The rule will only be activated by the new --languagemodel argument, which takes a path to the ngram data. As the ngram data is large (3,8GB when compressed), this isn't useful for the common user, but we might activate if for our API if we can make it work really well. Whereas the patch is quite big, most of it is test and evaluation code. The actual rule is quite simple, it just looks up the context occurrence counts in the data and if the alternative word is clearly more common, raises an error. It's not easy to tell how well this works. I did evaluations against a dyslexia corpus and optimized for precision and recall there, preferring good precision. But even then, the precision is quite bad on text which is already quite good, like Wikipedia. So we now have two values in ConfusionProbabilityRule (MIN_SCORE_DIFF, MIN_ALTERNATIVE_SCORE) which can be used to select between good recall and good precision. If you want to play with this, see http://wiki.languagetool.org/finding-errors-using-big-data Regards Daniel ------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel