So I finally got around to trying the Europarl processing scripts, using the Europarl v7 data since it is now available. I've encountered an oddity in the behavior of sentence-align-corpus.perl with respect to its calls to tools/split-sentences.perl.
The problem is that in some cases, less sentence splitting appears to occur when tools/split-sentences.perl is called from sentence-align-corpus.perl than when it is called on its own. I noticed this problem in the French text of ep-99-12-17.txt. If you take a look at line 9 of that file, there are 3 sentences on the one line. After running sentence-align-corpus.perl fr en, I took a look at aligned/fr-en/fr/ep-99-12-17.txt. The first two sentences of the old line 9 are still together, but the third sentence has been broken off onto its own line. However, if I run tools/split-sentences.perl -l fr < txt/fr/ep-99-12-17.txt, all three sentences from line 9 are split onto their own lines. I've tried debugging this, to no avail. Does anyone have any idea why this might be? Thanks, Lane On Wed, May 16, 2012 at 10:57 AM, Lane Schwartz <[email protected]> wrote: > Philipp, > > So was my guess about script order correct, then? > > > * Download europarl.tgz > > * Untar europarl.tgz > > * Run ./sentence-align-source-corpus.perl ${src} ${tgt} > > * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > > corpus.${src} > * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > > corpus.${tgt} > > * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml > > Thanks, > Lane > > > On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote: >> Hi, >> >> The XML tags help with the alignment, so they need to be stripped after >> sentence alignment, not before it. >> >> -phi >> >> On 16 May 2012 03:58, "Tom Hoar" <[email protected]> >> wrote: >>> >>> Interesting question, Lane. I would have expected an almost opposite >>> order, although I haven't looked at the commands for a while and using them >>> is on my list of things to do. Can someone who has used them shed some >>> light? >>> >>> * Download europarl.tgz >>> * Untar europarl.tgz >>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt >>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt >>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt > >>> ${src}.noxml.split.txt >>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt > >>> ${tgt}.noxml.split.txt >>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt >>> ${tgt}.noxml.split.txt >>> >>> >>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]> >>> wrote: >>> >>> I'm interested in being able to go from the Europarl v6 source tarball >>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned >>> parallel corpus for any given language pair. >>> >>> After extracting the contents of the tarball, there is one directory per >>> language stored in a txt dir (ex: txt/da). There is >>> sentence-align-corpus-perl, tools/split-sentences.perl, and >>> tools/de-xml.perl. >>> >>> I assume the order of running these scripts is as follows. If someone more >>> familiar with the Europarl release process could confirm or correct this, >>> that would be appreciated. >>> >>> * Download europarl.tgz >>> >>> * Untar europarl.tgz >>> >>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >>> >>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >>> corpus.${src} >>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >>> corpus.${tgt} >>> >>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >>> >>> >>> Is this the recommended process for deriving a sentence-aligned parallel >>> corpus for a language pair from the Europarl source release? >>> >>> Thanks, >>> Lane Schwartz >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
