Indeed, I fully agree with the point about understanding the limits. In fact, in some multi-reference corpora I have observed variations of more than 10 BLEU points when computing inter-reference BLEU scores (i.e., one reference against the other references). However, this issue is much broader, and would lead us to abandon all research on MT and focus on MT evaluation.

That said, IMHO the question is actually what we intend to evaluate. In the case of a MT evaluation campaign, with "final" systems, hypothesis testing should be ok (and might actually be the only option). If we intend to evaluate a new word alignment, then optimizer instability should be taken into account (although we all know that the complete pipeline between word alignments and final BLEU is difficult to predict). In fact, in my current experiments I tend to bootstrap the dev set 10 times and re-run the optimizer, although this is very costly for every experiment.

And, thanks for the reference, looks very interesting :)

Cheers,

Germán



On Thu, 24 Jan 2013, Chris Dyer wrote:

If you're interested in statistical significant testing, you really
ought to read the Clark et al. (2011) paper
(http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf). We showed that
the Koehn technique and related methods can indicate significance for
reasons that have little to do with the experimental manipulation that
is being tested--in particular, each time MERT (or virtually any other
optimizer) is run, you get a different system out, and these
differences can be "significant". With a bit more work, it is possible
to control for these effects, but there is no easy fix for the
statistical reliability problem in MT in general.  We are are
experimenting on top of a very unstable foundation. When it's
practical, hypothesis testing can help, but it is more important that
we, as a field, understand the limits of what it can do.
Best,
Chris
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to