Re: [Moses-support] statistical significance tests

Germán Sanchis Trilles Thu, 24 Jan 2013 05:01:27 -0800

Indeed, I fully agree with the point about understanding the limits. Infact, in some multi-reference corpora I have observed variations of morethan 10 BLEU points when computing inter-reference BLEU scores (i.e., onereference against the other references). However, this issue is muchbroader, and would lead us to abandon all research on MT and focus on MTevaluation.

That said, IMHO the question is actually what we intend to evaluate. Inthe case of a MT evaluation campaign, with "final" systems, hypothesistesting should be ok (and might actually be the only option). If we intendto evaluate a new word alignment, then optimizer instability should betaken into account (although we all know that the complete pipelinebetween word alignments and final BLEU is difficult to predict). In fact,in my current experiments I tend to bootstrap the dev set 10 times andre-run the optimizer, although this is very costly for every experiment.


And, thanks for the reference, looks very interesting :)

Cheers,

Germán



On Thu, 24 Jan 2013, Chris Dyer wrote:

If you're interested in statistical significant testing, you really
ought to read the Clark et al. (2011) paper
(http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf). We showed that
the Koehn technique and related methods can indicate significance for
reasons that have little to do with the experimental manipulation that
is being tested--in particular, each time MERT (or virtually any other
optimizer) is run, you get a different system out, and these
differences can be "significant". With a bit more work, it is possible
to control for these effects, but there is no easy fix for the
statistical reliability problem in MT in general.  We are are
experimenting on top of a very unstable foundation. When it's
practical, hypothesis testing can help, but it is more important that
we, as a field, understand the limits of what it can do.
Best,
Chris

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] statistical significance tests

Reply via email to