Hi guys,
Thanks for the suggestions. I have downloaded the EuroMatrixPlus corpora and extracted English and Russian to the text folder using the extract.py. Initially I just took all the files where the line numbers matched but that only gives me a corpus of around 500,000 lines. I noticed most of the files don't have matching lines numbers and many contain text that their counterpart does not contain e.g. A_53_647_CORR1_(en)(ru) What would be the steps involved in getting these files strictly aligned? Thanks Ben
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
