Hi, a rule that uses ngram occurrence data to detect errors in English text has now been activated on languagetool.org. Here are some errors it can detect which LT couldn't detect before:
I can't remember how to go their. I didn't now where it came from. Alabama has for of the world's largest stadiums. The word pairs supported are listed in confusion_sets.txt, currently these are: accept, except ate, eight extent, extend four, for know, now nice, mice pray, prey their, there you, your rite, right It seems to work quite well, for example for there/their the rule has a precision of 0.998 and a recall of recall 0.970. This means you can, on average, use 'there' or 'their' almost 1000 times before you will run into the first false alarm (precision). For any error where you mix up 'their' and 'there', there's a 97% chance that the error will be detected (recall). These values refer to Wikipedia and Tatoeba, some types of text might have values which are worse. Technical details: the rule is EnglishConfusionProbabilityRule and it's part of the LT download, but the data is not and won't be, because it's more than 6GB. More documentation can be found at http://wiki.languagetool.org/finding-errors-using-big-data Help with making this rule better is welcome, e.g. by evaluating the precision and recall of more word pairs and activating them in case they have good precision (>0.995 or so). This rule can be used for other languages if we prepare the data. See http://storage.googleapis.com/books/ngrams/books/datasetsv2.html for the languages offered by Google. Regards Daniel ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel