saeed smith <saeed.smith.1@...> writes: > > Thank you all (specially for the paper Chris mentioned).I agree with you Barry. But as Germán said, when optimizer is not involved in experiments (e.g. evaluating decoder modifications), the tool can be very useful. Am I missing something?
I guess the point is that even if you can reject the null hypothesis (e.g. that the score difference is caused by optimizer or test set randomness), this doesn't mean that your results are meaningful. One problem is that MT metrics still don't correlate very well with human judgment. Another is that since increasing the sample size (the test set) is very cheap, it's easy to increase its size to a point where you can reject your null hypothesis, even if the differences in score are small. A third problem is that differences can be non-random and big, but not tell you anything meaningful about the part of the system you're modifying. I actually had this problem playing around with [the Moses implementation of] PRO: adding some features made my system consistently better, adding others made it consistently worse compared to the baseline. The latter should never happen with a perfect optimizer (I'm talking about dev set scores, so there was no overfitting) - by copying the weights from the baseline and setting the weights of additional features to 0, you can replicate the baseline results with the new system. So if I add a feature and the optimizer finds a worse local optimum, can I conclude that the feature is bad? No, because optimizer instability is at the root of the difference, even if it is non-random. However, I think the discussion of whether to add a significance test to Moses comes a bit late, since I know of at least one such script that has been in there for years: mosesdecoder/scripts/analysis/bootstrap-hypothesis-difference-significance.pl _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
