Hi,

We are currently developing an MT system based on parallel data
containing some sentence pairs that are repeated very often, with one
of them appearing over 16000 times. While a certain repetition of
identical sentence pairs probably makes sense as they generally
indicate reliable translations and should thus be assigned higher
probabilities, there is possibly a threshold above which adding even
more occurences is no longer meaningful.

Does anyone have any advice on how we could establish such a
threshold? Maybe a proportionate reduction would work best? Has anyone
experimented with filtering corpora in this way? Or is reducing the
occurences of very frequent sentence pairs a bad idea after all?

Looking forward to your replies,
Micha
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to