Hi,

a rule that uses ngram occurrence data to detect errors in English text 
has now been activated on languagetool.org. Here are some errors it can 
detect which LT couldn't detect before:

   I can't remember how to go their.
   I didn't now where it came from.
   Alabama has for of the world's largest stadiums.

The word pairs supported are listed in confusion_sets.txt, currently 
these are:

accept, except
ate, eight
extent, extend
four, for
know, now
nice, mice
pray, prey
their, there
you, your
rite, right

It seems to work quite well, for example for there/their the rule has a 
precision of 0.998 and a recall of recall 0.970. This means you can, on 
average, use 'there' or 'their' almost 1000 times before you will run 
into the first false alarm (precision). For any error where you mix up 
'their' and 'there', there's a 97% chance that the error will be 
detected (recall). These values refer to Wikipedia and Tatoeba, some 
types of text might have values which are worse.

Technical details: the rule is EnglishConfusionProbabilityRule and it's 
part of the LT download, but the data is not and won't be, because it's 
more than 6GB.

More documentation can be found at 
http://wiki.languagetool.org/finding-errors-using-big-data

Help with making this rule better is welcome, e.g. by evaluating the 
precision and recall of more word pairs and activating them in case they 
have good precision (>0.995 or so). This rule can be used for other 
languages if we prepare the data. See 
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html for the 
languages offered by Google.

Regards
  Daniel


------------------------------------------------------------------------------
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to