I believe I have solved this mystery. For any given paragraph, after the script tools/split-sentences.perl has split the text up into sentences, the function sentence_align in sentence-align-corpus.perl has the prerogative to merge those sentences back together in order to attain a better sentence alignment with the parallel paragraph in the other language.
That is what appears to be happening here. For the paragraph in question, tools/split-sentences.perl splits the French into 3 sentences and the English into 2 sentences. The Church & Gale algorithm that the sentence_align subroutine implements then merges the first and second French sentences back together. Looking at the first English sentence, this is indeed the correct thing to do in this case. Cheers, Lane On Fri, Jul 27, 2012 at 4:52 PM, Lane Schwartz <[email protected]> wrote: > So I finally got around to trying the Europarl processing scripts, > using the Europarl v7 data since it is now available. I've encountered > an oddity in the behavior of sentence-align-corpus.perl with respect > to its calls to tools/split-sentences.perl. > > The problem is that in some cases, less sentence splitting appears to > occur when tools/split-sentences.perl is called from > sentence-align-corpus.perl than when it is called on its own. > > I noticed this problem in the French text of ep-99-12-17.txt. If you > take a look at line 9 of that file, there are 3 sentences on the one > line. > > After running sentence-align-corpus.perl fr en, I took a look at > aligned/fr-en/fr/ep-99-12-17.txt. The first two sentences of the old > line 9 are still together, but the third sentence has been broken off > onto its own line. > > However, if I run tools/split-sentences.perl -l fr < > txt/fr/ep-99-12-17.txt, all three sentences from line 9 are split onto > their own lines. > > I've tried debugging this, to no avail. Does anyone have any idea why > this might be? > > Thanks, > Lane > > > > On Wed, May 16, 2012 at 10:57 AM, Lane Schwartz <[email protected]> wrote: >> Philipp, >> >> So was my guess about script order correct, then? >> >> >> * Download europarl.tgz >> >> * Untar europarl.tgz >> >> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >> >> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >> corpus.${src} >> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >> corpus.${tgt} >> >> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >> >> Thanks, >> Lane >> >> >> On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote: >>> Hi, >>> >>> The XML tags help with the alignment, so they need to be stripped after >>> sentence alignment, not before it. >>> >>> -phi >>> >>> On 16 May 2012 03:58, "Tom Hoar" <[email protected]> >>> wrote: >>>> >>>> Interesting question, Lane. I would have expected an almost opposite >>>> order, although I haven't looked at the commands for a while and using them >>>> is on my list of things to do. Can someone who has used them shed some >>>> light? >>>> >>>> * Download europarl.tgz >>>> * Untar europarl.tgz >>>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt >>>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt >>>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt > >>>> ${src}.noxml.split.txt >>>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt > >>>> ${tgt}.noxml.split.txt >>>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt >>>> ${tgt}.noxml.split.txt >>>> >>>> >>>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]> >>>> wrote: >>>> >>>> I'm interested in being able to go from the Europarl v6 source tarball >>>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned >>>> parallel corpus for any given language pair. >>>> >>>> After extracting the contents of the tarball, there is one directory per >>>> language stored in a txt dir (ex: txt/da). There is >>>> sentence-align-corpus-perl, tools/split-sentences.perl, and >>>> tools/de-xml.perl. >>>> >>>> I assume the order of running these scripts is as follows. If someone more >>>> familiar with the Europarl release process could confirm or correct this, >>>> that would be appreciated. >>>> >>>> * Download europarl.tgz >>>> >>>> * Untar europarl.tgz >>>> >>>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt} >>>> >>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt > >>>> corpus.${src} >>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt > >>>> corpus.${tgt} >>>> >>>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml >>>> >>>> >>>> Is this the recommended process for deriving a sentence-aligned parallel >>>> corpus for a language pair from the Europarl source release? >>>> >>>> Thanks, >>>> Lane Schwartz >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >> >> >> >> -- >> When a place gets crowded enough to require ID's, social collapse is not >> far away. It is time to go elsewhere. The best thing about space travel >> is that it made it possible to go elsewhere. >> -- R.A. Heinlein, "Time Enough For Love" > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
