Hi all,personally I have an implementation of Koehn's 2004 ACL paper about statistical sifgnificance tests for MT evaluation. It implements both "stand-alone confidence intervals" (sec.5, bootstrap resampling) and paired bootstrap resampling, if a baseline is given. Right now, it computes confidence intervals for both TER and BLEU (including brev. penalty) using modified versions of multi-bleu.perl and tercom.jar which are packaged into the script itself, so that the resampling is performed on the TER and BLEU counts (instead of the sentences, which is extremely costly). I have been using it for some years now, so that it should be relatively robust. It implements bootstrap resampling for a given set of translations, i.e., it does not take into account optimizer instability.
If it is of any interest to the Moses project, I have no problem whatsoever donating it to the MT community ;)
Cheers, Germán
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
