Hi, We are currently developing an MT system based on parallel data containing some sentence pairs that are repeated very often, with one of them appearing over 16000 times. While a certain repetition of identical sentence pairs probably makes sense as they generally indicate reliable translations and should thus be assigned higher probabilities, there is possibly a threshold above which adding even more occurences is no longer meaningful.
Does anyone have any advice on how we could establish such a threshold? Maybe a proportionate reduction would work best? Has anyone experimented with filtering corpora in this way? Or is reducing the occurences of very frequent sentence pairs a bad idea after all? Looking forward to your replies, Micha _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
