Dear all,
I have collected a lot of parallel texts. A large number of them are from web pages and aligned by rules and algorithms, some of which lacks many sentences on one side (5:1), so the auto alignment contains lots of errors. Some of them are well aligned per paragraph. A few of them are mostly single pieces of articles which are aligned by hand or already aligned. Since the amount of data is not so great (less than a hundred MB), I must use it efficiently. At all cases I would manually check the test set line by line. Should I prefer the high-quality data for turning, and why? (I am actually seeking a explanation to convince myself to do so.)
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
