saeed smith <saeed.smith.1@...> writes:

> 
> Thank you all (specially for the paper Chris mentioned).I agree with you
Barry. But as Germán said, when optimizer is not involved in experiments (e.g.
evaluating decoder modifications), the tool can be very useful. Am I missing
something?

I guess the point is that even if you can reject the null hypothesis (e.g. that
the score difference is caused by optimizer or test set randomness), this
doesn't mean that your results are meaningful. One problem is that MT metrics
still don't correlate very well with human judgment. Another is that since
increasing the sample size (the test set) is very cheap, it's easy to increase
its size to a point where you can reject your null hypothesis, even if the
differences in score are small. A third problem is that differences can be
non-random and big, but not tell you anything meaningful about the part of the
system you're modifying. 

I actually had this problem playing around with [the Moses implementation of]
PRO: adding some features made my system consistently better, adding others made
it consistently worse compared to the baseline. The latter should never happen
with a perfect optimizer (I'm talking about dev set scores, so there was no
overfitting) - by copying the weights from the baseline and setting the weights
of additional features to 0, you can replicate the baseline results with the new
system. So if I add a feature and the optimizer finds a worse local optimum, can
I conclude that the feature is bad? No, because optimizer instability is at the
root of the difference, even if it is non-random.

However, I think the discussion of whether to add a significance test to Moses
comes a bit late, since I know of at least one such script that has been in
there for years:

mosesdecoder/scripts/analysis/bootstrap-hypothesis-difference-significance.pl

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to