Thanks! On Wed, May 16, 2012 at 11:14 AM, Philipp Koehn <[email protected]> wrote: > Hi, > > yes, that's correct. > > -phi > > On Wed, May 16, 2012 at 7:57 AM, Lane Schwartz <[email protected]> wrote: >> Philipp, >> >> So was my guess about script order correct, then? >> >> >> * Download europarl.tgz >> >> * Untar europarl.tgz >> >> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >> >> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >> corpus.${src} >> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >> corpus.${tgt} >> >> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >> >> Thanks, >> Lane >> >> >> On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote: >>> Hi, >>> >>> The XML tags help with the alignment, so they need to be stripped after >>> sentence alignment, not before it. >>> >>> -phi >>> >>> On 16 May 2012 03:58, "Tom Hoar" <[email protected]> >>> wrote: >>>> >>>> Interesting question, Lane. I would have expected an almost opposite >>>> order, although I haven't looked at the commands for a while and using them >>>> is on my list of things to do. Can someone who has used them shed some >>>> light? >>>> >>>> * Download europarl.tgz >>>> * Untar europarl.tgz >>>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt >>>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt >>>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt > >>>> ${src}.noxml.split.txt >>>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt > >>>> ${tgt}.noxml.split.txt >>>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt >>>> ${tgt}.noxml.split.txt >>>> >>>> >>>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]> >>>> wrote: >>>> >>>> I'm interested in being able to go from the Europarl v6 source tarball >>>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned >>>> parallel corpus for any given language pair. >>>> >>>> After extracting the contents of the tarball, there is one directory per >>>> language stored in a txt dir (ex: txt/da). There is >>>> sentence-align-corpus-perl, tools/split-sentences.perl, and >>>> tools/de-xml.perl. >>>> >>>> I assume the order of running these scripts is as follows. If someone more >>>> familiar with the Europarl release process could confirm or correct this, >>>> that would be appreciated. >>>> >>>> * Download europarl.tgz >>>> >>>> * Untar europarl.tgz >>>> >>>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >>>> >>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >>>> corpus.${src} >>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >>>> corpus.${tgt} >>>> >>>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >>>> >>>> >>>> Is this the recommended process for deriving a sentence-aligned parallel >>>> corpus for a language pair from the Europarl source release? >>>> >>>> Thanks, >>>> Lane Schwartz >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >> >> >> >> -- >> When a place gets crowded enough to require ID's, social collapse is not >> far away. It is time to go elsewhere. The best thing about space travel >> is that it made it possible to go elsewhere. >> -- R.A. Heinlein, "Time Enough For Love"
-- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
