Hi group, I need to compute statistical significance between a pair of system outputs and I've used the bootstrap resampling script in Moses. Unfortunately the BLEU scores from this script differs substantially (about 1.5 points short) than that of standard mteval script. I've also tried applying the same text normalization routine from mteval into the bootstrap resampling script (and modified the script bit so that it would normalize both hyps and refs) but the scores are still different.
The problem is that the moses bootstrap script suggests some system output to be statistically significant than a baseline (having absolute BLEU difference of 0.3), but the mteval BLEU score difference between those systems is only 0.1. I know multeval is an option, but again the scores are different and it doesn't do normalization. Any suggestions? Thanks - Baskaran
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
