Hi, The XML tags help with the alignment, so they need to be stripped after sentence alignment, not before it.
-phi On 16 May 2012 03:58, "Tom Hoar" <[email protected]> wrote: > Interesting question, Lane. I would have expected an almost opposite > order, although I haven't looked at the commands for a while and using them > is on my list of things to do. Can someone who has used them shed some > light? > > * Download europarl.tgz > * Untar europarl.tgz > * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt > * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt > * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt > > ${src}.noxml.split.txt > * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt > > ${tgt}.noxml.split.txt > * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt > ${tgt}.noxml.split.txt > > > On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]> > wrote: > > I'm interested in being able to go from the Europarl v6 source tarball ( > http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned > parallel corpus for any given language pair. > > After extracting the contents of the tarball, there is one directory per > language stored in a txt dir (ex: txt/da). There is > sentence-align-corpus-perl, tools/split-sentences.perl, and > tools/de-xml.perl. > > I assume the order of running these scripts is as follows. If someone more > familiar with the Europarl release process could confirm or correct this, > that would be appreciated. > > * Download europarl.tgz > > * Untar europarl.tgz > > * Run ./sentence-align-source-corpus.perl ${src} ${tgt} > > * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > > corpus.${src} > * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > > corpus.${tgt} > > * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml > > > Is this the recommended process for deriving a sentence-aligned parallel > corpus for a language pair from the Europarl source release? > > Thanks, > Lane Schwartz > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
