Indeed, I fully agree with the point about understanding the limits. In
fact, in some multi-reference corpora I have observed variations of more
than 10 BLEU points when computing inter-reference BLEU scores (i.e., one
reference against the other references). However, this issue is much
broader, and would lead us to abandon all research on MT and focus on MT
evaluation.
That said, IMHO the question is actually what we intend to evaluate. In
the case of a MT evaluation campaign, with "final" systems, hypothesis
testing should be ok (and might actually be the only option). If we intend
to evaluate a new word alignment, then optimizer instability should be
taken into account (although we all know that the complete pipeline
between word alignments and final BLEU is difficult to predict). In fact,
in my current experiments I tend to bootstrap the dev set 10 times and
re-run the optimizer, although this is very costly for every experiment.
And, thanks for the reference, looks very interesting :)
Cheers,
Germán
On Thu, 24 Jan 2013, Chris Dyer wrote:
If you're interested in statistical significant testing, you really
ought to read the Clark et al. (2011) paper
(http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf). We showed that
the Koehn technique and related methods can indicate significance for
reasons that have little to do with the experimental manipulation that
is being tested--in particular, each time MERT (or virtually any other
optimizer) is run, you get a different system out, and these
differences can be "significant". With a bit more work, it is possible
to control for these effects, but there is no easy fix for the
statistical reliability problem in MT in general. We are are
experimenting on top of a very unstable foundation. When it's
practical, hypothesis testing can help, but it is more important that
we, as a field, understand the limits of what it can do.
Best,
Chris
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support