Hi Vincent, On Thu, 2015-09-24 at 22:37 +0200, Vincent Nguyen wrote: > Thanks Matthias for the detailed explanation. > I think I have most of it in mind except not really understanding how > this one works : > > "Difficult sentences generally have worse model score than easy ones but > may still be useful for training."
Well, your data selection method may discard training instances that are somehow hard to decode, e.g. because of complex sentence structure or because of rare vocabulary. But that doesn't necessarily mean that it's bad sentence pairs that you're removing. You should manually inspect some samples if possible. I didn't try, but I suspect that you'd get a higher decoder score on the 1-best decoder output of the first of the following two input sentences: (1) " Merci ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! " (2) " Je l' ai vécu moi-même en personne quand j' ai eu mon diplôme à Barnard College en 2002 . " (Just as a simple made-up example.) If we assume that you have a correct English target sentence for both of those sentences in your training data, I wonder which of the two you could learn more from? If you're doing what I think, then you're also basically just assessing whether the source side of the sentence pair is easy to translate. Does this tell you anything about the target sentence? The target side might be misaligned or in a different third language if your data is noisy. Cheers, Matthias -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
