using ngram data to detect errors

Daniel Naber Wed, 16 Sep 2015 13:51:24 -0700

Hi,

some time ago, I've added a rule for English to detect errors 
statistically, by using large ngram data sets. I've activated the rule 
now for all languages that we have data for: Chinese, French, Italian, 
Russian, and Spanish (German had been activated for some time already).


That means rule developers can add word pairs to the 
'confusion_sets.txt' file and LT will try to detect wrong usage of 
either word of the pair. Here's how you can use this approach to detect 
errors:

1.) Download the (large) data from 
http://languagetool.org/download/ngram-data/untested/ for your language
2.) Follow the documentation at 
http://wiki.languagetool.org/adding-n-gram-data-rules

This is not a general replacement for writing rules manually, but it's 
often easier and it sometimes works better. In my experience, it's had 
to tell which word pairs work will with this approach, it's something 
one just has to experiment with.

Please give it a try and let me know if you have feedback or questions.

Regards
  Daniel


------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

using ngram data to detect errors

Reply via email to