Re: [Moses-support] Europarl source release and processing scripts

Tom Hoar Wed, 16 May 2012 03:54:53 -0700


Interesting question, Lane. I would have expected an almost opposite
order, although I haven't looked at the commands for a while and using
them is on my list of things to do. Can someone who has used them shed
some light?


* Download europarl.tgz
* Untar europarl.tgz
* Run
./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt
* Run
./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt
* Run
./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt >
${src}.noxml.split.txt
* Run ./tools/split-sentences.perl -l ${src} <
${tgt}.noxml.txt > ${tgt}.noxml.split.txt
* Run
./sentence-align-source-corpus.perl ${src}.noxml.split.txt
${tgt}.noxml.split.txt

On Fri, 11 May 2012 15:44:05 -0400, Lane
Schwartz  wrote:  

I'm interested in being able to go from the Europarl
v6 source tarball (http://www.statmt.org/europarl/v6/europarl.tgz [1])
to a sentence-aligned parallel corpus for any given language
pair.

After extracting the contents of the tarball, there is one
directory per language stored in a txt dir (ex: txt/da). There is
sentence-align-corpus-perl, tools/split-sentences.perl, and
tools/de-xml.perl.

I assume the order of running these scripts is as
follows. If someone more familiar with the Europarl release process
could confirm or correct this, that would be appreciated.

* Download
europarl.tgz

* Untar europarl.tgz

* Run
./sentence-align-source-corpus.perl ${src} ${tgt}

* Run
./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
corpus.${src}
* Run ./tools/split-sentences.perl -l ${src} <
aligned/${tgt}/*.txt > corpus.${tgt}

* Run ./tools/de-xml.perl corpus
${src} ${tgt} corpus.noxml

Is this the recommended process for deriving
a sentence-aligned parallel corpus for a language pair from the Europarl
source release?

Thanks,
Lane Schwartz

 

Links:
------
[1]
http://www.statmt.org/europarl/v6/europarl.tgz

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Europarl source release and processing scripts

Reply via email to