Hi group,

I need to compute statistical significance between a pair of system outputs
and I've used the bootstrap resampling script in Moses. Unfortunately the
BLEU scores from this script differs substantially (about 1.5 points short)
than that of standard mteval script. I've also tried applying the same text
normalization routine from mteval into the bootstrap resampling script (and
modified the script bit so that it would normalize both hyps and refs) but
the scores are still different.

The problem is that the moses bootstrap script suggests some system output
to be statistically significant than a baseline (having absolute BLEU
difference of 0.3), but the mteval BLEU score difference between those
systems is only 0.1.

I know multeval is an option, but again the scores are different and it
doesn't do normalization. Any suggestions?

Thanks
- Baskaran
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to