Mark Fishel <fis...@...> writes: > > Dear readers, > > we keep getting strange, unexpected and sometimes illogical results in > more than one series of SMT experiments using the JRC Acquis parallel > corpus. Often the same methods work fine on Europarl. Our question is
Hi Mark, We have been using *extensively* the JRC acquis corpus and I can assure you that we had no big problems. Some colleagues, who have used the program that comes with the corpus, did have some slight problems. I have chosen to unzip the several volumes manually and never had them. For this as well as for other corpora, some characters can derail the training. We have developed Moses for Mere Mortals (http://code.google.com/p/moses-for-mere-mortals/), that provides a Windows add-in (Extract_TMX_Corpus) that helps to clean such things and creates corpora that you can directly feed to Moses (UTF-8, Linux newlines, removal of control characters and so on). Therefore, I can assure you that the JRC acquis definitively works. It seems me that the Moses team has already published data about their experiments with this corpus. It covers most, if not all, the language pairs of the European Union, what is a plus. Greetings, João _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
