[Moses-support] Use high-quality corpus for training or turning?

Dingyuan Wang Wed, 24 Jun 2015 09:59:07 -0700

Dear all,


I have collected a lot of parallel texts. A large number of them are from
web pages and aligned by rules and algorithms, some of which lacks many
sentences on one side (5:1), so the auto alignment contains lots of errors.
Some of them are well aligned per paragraph. A few of them are mostly
single pieces of articles which are aligned by hand or already aligned.
Since the amount of data is not so great (less than a hundred MB), I must
use it efficiently.
At all cases I would manually check the test set line by line.
Should I prefer the high-quality data for turning, and why?
(I am actually seeking a explanation to convince myself to do so.)

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Use high-quality corpus for training or turning?

Reply via email to