I'm interested in being able to go from the Europarl v6 source tarball (
http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned
parallel corpus for any given language pair.

After extracting the contents of the tarball, there is one directory per
language stored in a txt dir (ex: txt/da). There is
sentence-align-corpus-perl, tools/split-sentences.perl, and
tools/de-xml.perl.

I assume the order of running these scripts is as follows. If someone more
familiar with the Europarl release process could confirm or correct this,
that would be appreciated.

* Download europarl.tgz

* Untar europarl.tgz

* Run ./sentence-align-source-corpus.perl ${src} ${tgt}

* Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
corpus.${src}
* Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
corpus.${tgt}

* Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml


Is this the recommended process for deriving a sentence-aligned parallel
corpus for a language pair from the Europarl source release?

Thanks,
Lane Schwartz
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to