Hi Jonathan, Thanks for your reply. I usually prefer to use mteval script that allows me to retain the original reference and secondly this also ensures consistency among all the systems being evaluated. With this, I just detokenize the system output and measure lower-cased BLEU for all systems.
Btw, I didn't understand this: ... It shouldn't be a huge job to do the same with Moses' bootstrap resampling script, extracting its normalization as a separate step. ... As far as I know, the bootstrap resampling script doesn't do any normalization. At least I don't see such a version in my Moses directory obtained from github around June '12. Anyway, the BLEU scores are below. These scores are on the Arabic-English MTA testset of 1313 sentences. The system was trained on the Ar-En ISI parallel corpus (~1.1M sentence pairs) and tuned on a separate MTA tuning set of 1664 sentences. The scores marked with * are statistically significant at p = 0.01. Pls see below for explanation about footnotes 1 and 2. Baseline System-1 mteval-11b 1 36.06 36.16 - w/o normalization 2,3 35.73 35.89 multi-bleu.pl 2 34.15 34.52 bootstrap resampling 2 34.15 34.52* - with normalization 1,4 34.59 34.90* multeval (0.4.3) 2 32.70 33.30* 1 Uses original reference with detokenized and normalized system out 2 Uses detokenized reference with raw system output 3 It disables call to the NormalizeText() method in mteval-11b 4 Calls the NormalizeText() method (copied from mteval-11b) as a pre-process Any suggestions?? cheers - Baskaran On Tue, Apr 9, 2013 at 7:36 AM, Jonathan Clark <[email protected]>wrote: > Hi Baskaran, > > I've had similar issues when dealing with metric scripts that perform > their own normalization. As a first step, you might consider performing > normalization as a pre-processing step and disabling all normalization > within the scripts. Michael Denkowski has a version of mteval that has > normalization disabled: > https://github.com/mjdenkowski/meteor/tree/master/mt-diff/files. It > shouldn't be a huge job to do the same with Moses' bootstrap resampling > script, extracting its normalization as a separate step. This will at least > allow you to examine the inputs and blame either text normalization or > mathematics. Selfishly, I'd also like to know if multeval's bootstrap > resampling differs in its calculations of bootstrap resampling. :) > > Usually, I'm not a fan of doing any normalization besides the tokenization > inherent to the MT system, but I know sometimes this isn't an option if you > don't have control over one of the systems involved in the comparison. > > Could you also post absolute BLEU scores? Sometimes, smoothing can make a > difference with lower-scoring systems. > > Cheers, > Jon > > > On Mon, Apr 8, 2013 at 9:13 PM, Baskaran Sankaran <[email protected]>wrote: > >> Hi group, >> >> I need to compute statistical significance between a pair of system >> outputs and I've used the bootstrap resampling script in Moses. >> Unfortunately the BLEU scores from this script differs substantially (about >> 1.5 points short) than that of standard mteval script. I've also tried >> applying the same text normalization routine from mteval into the bootstrap >> resampling script (and modified the script bit so that it would normalize >> both hyps and refs) but the scores are still different. >> >> The problem is that the moses bootstrap script suggests some system >> output to be statistically significant than a baseline (having absolute >> BLEU difference of 0.3), but the mteval BLEU score difference between those >> systems is only 0.1. >> >> I know multeval is an option, but again the scores are different and it >> doesn't do normalization. Any suggestions? >> >> Thanks >> - Baskaran >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
