Hi, yes, that's correct.
-phi On Wed, May 16, 2012 at 7:57 AM, Lane Schwartz <[email protected]> wrote: > Philipp, > > So was my guess about script order correct, then? > > > * Download europarl.tgz > > * Untar europarl.tgz > > * Run ./sentence-align-source-corpus.perl ${src} ${tgt} > > * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > > corpus.${src} > * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > > corpus.${tgt} > > * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml > > Thanks, > Lane > > > On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote: >> Hi, >> >> The XML tags help with the alignment, so they need to be stripped after >> sentence alignment, not before it. >> >> -phi >> >> On 16 May 2012 03:58, "Tom Hoar" <[email protected]> >> wrote: >>> >>> Interesting question, Lane. I would have expected an almost opposite >>> order, although I haven't looked at the commands for a while and using them >>> is on my list of things to do. Can someone who has used them shed some >>> light? >>> >>> * Download europarl.tgz >>> * Untar europarl.tgz >>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt >>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt >>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt > >>> ${src}.noxml.split.txt >>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt > >>> ${tgt}.noxml.split.txt >>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt >>> ${tgt}.noxml.split.txt >>> >>> >>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]> >>> wrote: >>> >>> I'm interested in being able to go from the Europarl v6 source tarball >>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned >>> parallel corpus for any given language pair. >>> >>> After extracting the contents of the tarball, there is one directory per >>> language stored in a txt dir (ex: txt/da). There is >>> sentence-align-corpus-perl, tools/split-sentences.perl, and >>> tools/de-xml.perl. >>> >>> I assume the order of running these scripts is as follows. If someone more >>> familiar with the Europarl release process could confirm or correct this, >>> that would be appreciated. >>> >>> * Download europarl.tgz >>> >>> * Untar europarl.tgz >>> >>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >>> >>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >>> corpus.${src} >>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >>> corpus.${tgt} >>> >>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >>> >>> >>> Is this the recommended process for deriving a sentence-aligned parallel >>> corpus for a language pair from the Europarl source release? >>> >>> Thanks, >>> Lane Schwartz >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
