Hi guys,

 

Thanks for the suggestions. I have downloaded the EuroMatrixPlus corpora
and extracted English and Russian to the text folder using the
extract.py. Initially I just took all the files where the line numbers
matched but that only gives me a corpus of around 500,000 lines. I
noticed most of the files don't have matching lines numbers and many
contain text that their counterpart does not contain e.g.
A_53_647_CORR1_(en)(ru)

 

What would be the steps involved in getting these files strictly
aligned?

 

Thanks

 

Ben

 

 

 

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to