Mark Fishel <fis...@...> writes:

> 
> Dear readers,
> 
> we keep getting strange, unexpected and sometimes illogical results in
> more than one series of SMT experiments using the JRC Acquis parallel
> corpus. Often the same methods work fine on Europarl. Our question is

Hi Mark,

We have been using *extensively* the JRC acquis corpus and I can assure you that
we had no big problems. Some colleagues, who have used the program that comes
with the corpus, did have some slight problems. I have chosen to unzip the
several volumes manually and never had them. For this as well as for other
corpora, some characters can derail the training. We have developed Moses for
Mere Mortals (http://code.google.com/p/moses-for-mere-mortals/), that provides a
Windows add-in (Extract_TMX_Corpus) that helps to clean such things and creates
corpora that you can directly feed to Moses (UTF-8, Linux newlines, removal of
control characters and so on). Therefore, I can assure you that the JRC acquis
definitively works. It seems me that the Moses team has already published data
about their experiments with this corpus. It covers most, if not all, the
language pairs of the European Union, what is a plus.

Greetings,

João






_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to