French: detecting errors with statistics

Daniel Naber Tue, 29 Mar 2016 14:09:18 -0700

Hi,

even though I don't speak French, I've started adding confusion pairs 
for French. Here's an example from fr/confusion_sets.txt:


quand; quant; 1000000                                    # p=1.000, 
r=0.662, 186+988, 3grams, 2016-03-29

This means that whenever 'quand' appears, LT checks whether 'quant' 
isn't more probable here using Google ngrams[1] and vice versa. 
'1000000' is a factor to avoid false alarms. p=1.000, r=0.662 means: 
with my evaluation set, this pair has a precision of 1, i.e. it doesn't 
produce any false alarms and a recall of 0.662, i.e. 66,2% of all errors 
are detected.

So far, there are only 9 pairs like this (pris/prix, don/donc, dans/dent 
etc.) but I'm going to add more. I'll do the same for Spanish. Feel free 
to also add pairs. You can check how well a pair works (and find a good 
factor with a low false alarm rate) using ConfusionRuleEvaluator from 
the languagetool-dev module.

Regards
  Daniel

[1] http://wiki.languagetool.org/finding-errors-using-n-gram-data


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

French: detecting errors with statistics

Reply via email to