Hi all, I have a few questions about quality of training and tuning. If anyone has any clarifications, that would be nice! :-)
1/ According to the documentation: « sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training » So is it only for the sake of training speed or can too long sentences end up being a liability in MT quality? In other words, when I finally need to train "for real usage", should I really remove long sentences? 2/ My data is taken from real crowd-sourced translated data. As a consequence, we end up with some duplicates (same original text and same translation). I wonder if for training, that either doesn't matter, or else we should remove duplicates, or finally that's better to have duplicates. I would imagine the latter (keep duplicates) is the best as this is "statistical machine learning" and after all, these represent "real life" duplicates (text we often encounter and that we apparently usually translate the same way) so that would be good to "insist on" these translations during training. Am I right? 3/ Do training and tuning data have necessarily to be different? I guess for it to be meaningful, it should, and various examples on the website seem to go in that way, but I could not read anything clearly stating this. Thanks. Jehan _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
