Hi all,

personally I have an implementation of Koehn's 2004 ACL paper about statistical sifgnificance tests for MT evaluation. It implements both "stand-alone confidence intervals" (sec.5, bootstrap resampling) and paired bootstrap resampling, if a baseline is given. Right now, it computes confidence intervals for both TER and BLEU (including brev. penalty) using modified versions of multi-bleu.perl and tercom.jar which are packaged into the script itself, so that the resampling is performed on the TER and BLEU counts (instead of the sentences, which is extremely costly). I have been using it for some years now, so that it should be relatively robust. It implements bootstrap resampling for a given set of translations, i.e., it does not take into account optimizer instability.

If it is of any interest to the Moses project, I have no problem whatsoever donating it to the MT community ;)

Cheers,

Germán
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to