Hi all,

I have a few questions about quality of training and tuning. If anyone
has any clarifications, that would be nice! :-)

1/ According to the documentation:
«
sentences longer than 100 words (and their corresponding translations)
have to be eliminated
   (note that a shorter sentence length limit will speed up training
»
So is it only for the sake of training speed or can too long sentences
end up being a liability in MT quality? In other words, when I finally
need to train "for real usage", should I really remove long sentences?

2/ My data is taken from real crowd-sourced translated data. As a
consequence, we end up with some duplicates (same original text and
same translation). I wonder if for training, that either doesn't
matter, or else we should remove duplicates, or finally that's better
to have duplicates.

I would imagine the latter (keep duplicates) is the best as this is
"statistical machine learning" and after all, these represent "real
life" duplicates (text we often encounter and that we apparently
usually translate the same way) so that would be good to "insist on"
these translations during training.
Am I right?

3/ Do training and tuning data have necessarily to be different? I
guess for it to be meaningful, it should, and various examples on the
website seem to go in that way, but I could not read anything clearly
stating this.

Thanks.

Jehan

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to