Interesting question, Lane. I would have expected an almost opposite
order, although I haven't looked at the commands for a while and using
them is on my list of things to do. Can someone who has used them shed
some light?
* Download europarl.tgz
* Untar europarl.tgz
* Run
./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt
* Run
./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt
* Run
./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt >
${src}.noxml.split.txt
* Run ./tools/split-sentences.perl -l ${src} <
${tgt}.noxml.txt > ${tgt}.noxml.split.txt
* Run
./sentence-align-source-corpus.perl ${src}.noxml.split.txt
${tgt}.noxml.split.txt
On Fri, 11 May 2012 15:44:05 -0400, Lane
Schwartz wrote:
I'm interested in being able to go from the Europarl
v6 source tarball (http://www.statmt.org/europarl/v6/europarl.tgz [1])
to a sentence-aligned parallel corpus for any given language
pair.
After extracting the contents of the tarball, there is one
directory per language stored in a txt dir (ex: txt/da). There is
sentence-align-corpus-perl, tools/split-sentences.perl, and
tools/de-xml.perl.
I assume the order of running these scripts is as
follows. If someone more familiar with the Europarl release process
could confirm or correct this, that would be appreciated.
* Download
europarl.tgz
* Untar europarl.tgz
* Run
./sentence-align-source-corpus.perl ${src} ${tgt}
* Run
./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
corpus.${src}
* Run ./tools/split-sentences.perl -l ${src} <
aligned/${tgt}/*.txt > corpus.${tgt}
* Run ./tools/de-xml.perl corpus
${src} ${tgt} corpus.noxml
Is this the recommended process for deriving
a sentence-aligned parallel corpus for a language pair from the Europarl
source release?
Thanks,
Lane Schwartz
Links:
------
[1]
http://www.statmt.org/europarl/v6/europarl.tgz
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support