Hi, sentence alignment of these corpora is not perfect, so some of the sentence pairs have sentences that actually do not correspond to each other.
Your example has empty lines - this is something that the script clean-corpus-n.perl should filter out. It is absolutely essential to run that script before running GIZA++. -phi On Thu, May 17, 2012 at 5:27 PM, Heather Macbeth <[email protected]>wrote: > Hi Barry and Moses-support, > > Thanks again for your detailed suggestions. I believe I've fixed the > errors I asked about. > > With the "sentence mismatch error," the problem was that at that line the > news-commentary-v7.de-en parallel corpora apparently go out of sync. Is > this known? Or deliberate? > > 41812:Es war Deutschlands glücklichste Nacht. > 41813:Betrachtet man sich deren Folgen zwanzig Jahre später, so liegen > revolutionäre Veränderungen hinter uns: > 41814: > 41815: > 41816:Die Sowjetunion und ihr Imperium sind sang- und klanglos > verschwunden und mit ihr die ganze Weltordnung des Kalten Krieges. > 41817:Deutschland wurde wiedervereinigt, Osteuropa und die Staaten der > sowjetischen Peripherie gewannen ihre Unabhängigkeit, das Apartheid-Regime > in Südafrika löste sich auf, zahlreiche Bürgerkriege in Asien, Afrika und > Lateinamerika fanden ein Ende, im Nahen Osten kamen Israelis und > Palästinenser einem Frieden so nah, wie seitdem nicht mehr, und das > auseinander brechende Jugoslawien versank in Krieg und ethnischer > Säuberung. In Afghanistan ging der Krieg unter neuen Vorzeichen weiter, und > das sollte Konsequenzen haben. > 41818:Die USA waren der siegreiche Erbe der zusammengebrochenen Ordnung > des Kalten Krieges und standen allein und unangefochten auf dem Gipfel > ihrer globalen Macht. > > > 41812:It was Germany's happiest hour. > 41813:Twenty years later, many revolutionary consequences of that night > lie behind us. The Soviet Union and its empire quietly disappeared, and > with them the Cold War international order. > 41814:Germany was reunited; Eastern Europe and the states on the Soviet > periphery won their independence; South Africa's apartheid regime fell > apart, numerous civil wars in Asia, Africa, and Latin America ended; > Israelis and Palestinians came closer to peace than at any time since; and > a disintegrating Yugoslavia degenerated into war and ethnic cleansing. > 41815:In Afghanistan, war continued under different circumstances, with > serious ramifications for the region and, indeed, the world. > 41816:As the victorious heir to the collapsed Cold War order, the United > States stood alone, undisputed, at the peak of its global power. > > > (The "Giza did not produce the output file" ERROR I had previously fixed, > quite along the lines Barry Haddow suggested -- see my post of Thu 17 May, > 3 am GMT.) > > Sincerely, > Heather Macbeth > > > On Thu, May 17, 2012 at 3:57 AM, Barry Haddow > <[email protected]>wrote: > >> Hi Heather >> >> It all looks quite normal until the sentence mismatch errors start. >> Although I >> didn't see this error: >> >> > > (line 2385) ERROR: Giza did not produce the output file >> > > train/giza.de-en/de-en.A3.final. Is your corpus clean >> (reasonably-sized >> > > sentences)? at >> /home/heather/mosesdecoder/dist/training/train-model.perl >> > > line 1077. >> >> in your log file. From the log file and directory listings you gave me, it >> seemed that giza *did* produce its output (ie >> train/giza.de-en/de-en.A3.final >> and the equivalent in the other direction). The sentence mismatch errors >> indicates that the forward and backward giza outputs are not compatible. >> >> I wonder if you had some files left over from previous run? >> >> Could you try running from train-model.perl the start on a completely >> clean >> directory, and if it fails, post the output? If the first failure is a >> sentence >> mismatch error, maybe you could post the de-en.A3.final.gz and >> en-de.A3.final.gz >> files, >> >> cheers - Barry >> >> >> On Thursday 17 May 2012 04:06:21 Heather Macbeth wrote: >> > Hi Barry and Moses-support, >> > >> > Thanks for getting back to me. In answer to Barry's questions, >> > * I'm using giza, not mgiza. >> > * I've put my training.out at >> > math.princeton.edu/~macbeth/training.out<http://math.princeton.edu/%7Emacbeth/training.out> >> > * I've listed the files produced by the script at the end of this email. >> > >> > Of the two errors I asked about earlier, one remains: In Step 3, a >> > "sentence mismatch error" on almost every sentence. (Lines >> > #41604-#158020.) >> > >> > I noticed an earlier thread >> > http://www.mail-archive.com/[email protected]/msg02130.html >> > in which something similar was reported. There, Felipe Sánchez >> > Martínezsuggested cleaning the corpus as a fix. I had done that. I'd >> > be very >> > grateful for suggestions of other things to investigate. Or is this >> > something I shouldn't worry about? >> > >> > (I had also asked about another ERROR: in Step 2, the "Giza did not >> > produce the output file." It didn't recur on re-running, and I think it >> > may be fixed -- I believe the problem last time might have been not >> > stripping the working directory of files from previous partial runs.) >> > >> > Sincerely, >> > Heather Macbeth >> > >> > >> > ** Files produced in working directory by Steps 1-3 ** >> > >> > .: >> > train >> > training [this is a script of mine] >> > training.out >> > >> > ./train: >> > corpus >> > giza.de-en >> > giza.en-de >> > model >> > >> > ./train/corpus: >> > de-en-int-train.snt >> > de.vcb >> > de.vcb.classes >> > de.vcb.classes.cats >> > en-de-int-train.snt >> > en.vcb >> > en.vcb.classes >> > en.vcb.classes.cats >> > >> > ./train/giza.de-en: >> > de-en.A3.final.gz >> > de-en.cooc >> > de-en.gizacfg >> > >> > ./train/giza.en-de: >> > en-de.A3.final.gz >> > en-de.cooc >> > en-de.gizacfg >> > >> > ./train/model: >> > aligned.grow-diag-final-and >> > >> > On Wed, May 16, 2012 at 3:50 PM, Barry Haddow >> <[email protected]>wrote: >> > > Hi Heather >> > > >> > > Could you post the training.out file (or at least the step 2 part of >> it)? >> > > >> > > Are you using giza or mgiza? >> > > >> > > What files did giza produce? >> > > >> > > Cheers - Barry >> > > >> > > Sent from my ZX81 >> > > >> > > >> > > ----- Reply message ----- >> > > From: "Heather Macbeth" <[email protected]> >> > > Date: Wed, May 16, 2012 20:04 >> > > Subject: [Moses-support] "Giza did not produce the output file" -- on >> > > cleaned corpus >> > > To: <[email protected]> >> > > >> > > Hi Moses-support, >> > > >> > > I'm looking for help on a problem that arose while building a baseline >> > > system. Apart from changing FR to DE, I've tried to follow the >> > > instructions http://www.statmt.org/moses/?n=Moses.Baseline exactly. >> > > >> > > When I run the script train-model, the transcript training.out reports >> > > >> > > (line 2385) ERROR: Giza did not produce the output file >> > > train/giza.de-en/de-en.A3.final. Is your corpus clean >> (reasonably-sized >> > > sentences)? at >> /home/heather/mosesdecoder/dist/training/train-model.perl >> > > line 1077. >> > > (line 3285) ERROR: Giza did not produce the output file >> > > train/giza.en-de/en-de.A3.final. Is your corpus clean >> (reasonably-sized >> > > sentences)? at >> /home/heather/mosesdecoder/dist/training/train-model.perl >> > > line 1077. >> > > >> > > (I had indeed cleaned the corpus as instructed. The output concluded >> > > with Input sentences: 158840 Output sentences: 158020 >> > > So I take it this step went through ok.) >> > > >> > > It seems that people have had this problem before, for instance >> > > http://www.mail-archive.com/[email protected]/msg03434.html >> > > >> > > Barry Haddow's suggestion in that thread was to "have a look at the >> giza >> > > log file to see what went wrong. Maybe the merging of alignments >> failed." >> > > Does "giza log file" mean the Step 2 part of training.out? If so, >> I've >> > > tried this, but I'm not exactly sure what I'm looking for. There are >> a >> > > lot of WARNINGS, mainly of the form "already N iterations in >> hillclimb," >> > > but no other errors. >> > > >> > > Any suggestions for what symptoms to look for in the giza log file >> would >> > > be very welcome. >> > > >> > > >> > > In case it's relevant, let me mention another error that happens later >> > > (which I assume is a consequence of the first error): during word >> > > alignments, a "sentence mismatch error" on almost every sentence. >> Here's >> > > the relevant part of the transcript: at the beginning of Step 3 >> (around >> > > line 5500): >> > > >> > > (3) generate word alignment @ Mon May 14 05:17:23 EDT 2012 >> > > Combining forward and inverted alignment from files: >> > > train/giza.de-en/de-en.A3.final.{bz2,gz} >> > > train/giza.en-de/en-de.A3.final.{bz2,gz} >> > > Executing: mkdir -p train/model >> > > Executing: /home/heather/mosesdecoder/dist/training/symal/giza2bal.pl-d >> > > "gzip -cd train/giza.en-de/en-de.A3.final.gz" -i "gzip -cd >> > > train/giza.de-en/de-en.A3.final.gz" >> > > >> > > |/home/heather/mosesdecoder/dist/training/symal/symal >> -alignment="grow" >> > > >> > > -diagonal="yes" -final="yes" -both="yes" > >> > > train/model/aligned.grow-diag-final-and >> > > symal: computing grow alignment: diagonal (1) final (1)both-uncovered >> (1) >> > > Sentence mismatch error! Line #16665 >> > > Sentence mismatch error! Line #16666 >> > > Sentence mismatch error! Line #16667 >> > > Sentence mismatch error! Line #16668 >> > > Sentence mismatch error! Line #16669 >> > > .... >> > > Sentence mismatch error! Line #158018 >> > > Sentence mismatch error! Line #158019 >> > > Sentence mismatch error! Line #158020 >> > > >> > > >> > > Sincerely, >> > > Heather Macbeth >> > > >> > > >> > > >> > > The University of Edinburgh is a charitable body, registered in >> > > Scotland, with registration number SC005336. >> > >> >> -- >> Barry Haddow >> University of Edinburgh >> +44 (0) 131 651 3173 >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
