Hi,

sentence alignment of these corpora is not perfect, so some of
the sentence pairs have sentences that actually do not correspond
to each other.

Your example has empty lines - this is something that the script
clean-corpus-n.perl should filter out. It is absolutely essential to
run that script before running GIZA++.

-phi

On Thu, May 17, 2012 at 5:27 PM, Heather Macbeth <[email protected]>wrote:

> Hi Barry and Moses-support,
>
> Thanks again for your detailed suggestions.  I believe I've fixed the
> errors I asked about.
>
> With the "sentence mismatch error," the problem was that at that line the
> news-commentary-v7.de-en parallel corpora apparently go out of sync.  Is
> this known?  Or deliberate?
>
> 41812:Es war Deutschlands glücklichste Nacht.
> 41813:Betrachtet man sich deren Folgen zwanzig Jahre später, so liegen
> revolutionäre Veränderungen hinter uns:
> 41814:
> 41815:
> 41816:Die Sowjetunion und ihr Imperium sind sang- und klanglos
> verschwunden und mit ihr die ganze Weltordnung des Kalten Krieges.
> 41817:Deutschland wurde wiedervereinigt, Osteuropa und die Staaten der
> sowjetischen Peripherie gewannen ihre Unabhängigkeit, das Apartheid-Regime
> in Südafrika löste sich auf, zahlreiche Bürgerkriege in Asien, Afrika und
> Lateinamerika fanden ein Ende, im Nahen Osten kamen Israelis und
> Palästinenser einem Frieden so nah, wie seitdem nicht mehr, und das
> auseinander brechende Jugoslawien versank in Krieg und ethnischer
> Säuberung. In Afghanistan ging der Krieg unter neuen Vorzeichen weiter, und
> das sollte Konsequenzen haben.
> 41818:Die USA waren der siegreiche Erbe der zusammengebrochenen Ordnung
> des Kalten Krieges und standen allein und unangefochten auf dem Gipfel
> ihrer globalen Macht.
>
>
> 41812:It was Germany's happiest hour.
> 41813:Twenty years later, many revolutionary consequences of that night
> lie behind us. The Soviet Union and its empire quietly disappeared, and
> with them the Cold War international order.
> 41814:Germany was reunited; Eastern Europe and the states on the Soviet
> periphery won their independence; South Africa's apartheid regime fell
> apart, numerous civil wars in Asia, Africa, and Latin America ended;
> Israelis and Palestinians came closer to peace than at any time since; and
> a disintegrating Yugoslavia degenerated into war and ethnic cleansing.
> 41815:In Afghanistan, war continued under different circumstances, with
> serious ramifications for the region and, indeed, the world.
> 41816:As the victorious heir to the collapsed Cold War order, the United
> States stood alone, undisputed, at the peak of its global power.
>
>
> (The "Giza did not produce the output file" ERROR I had previously fixed,
> quite along the lines Barry Haddow suggested -- see my post of Thu 17 May,
> 3 am GMT.)
>
> Sincerely,
> Heather Macbeth
>
>
> On Thu, May 17, 2012 at 3:57 AM, Barry Haddow 
> <[email protected]>wrote:
>
>> Hi Heather
>>
>> It all looks quite normal until the sentence mismatch errors start.
>> Although I
>> didn't see this error:
>>
>> > > (line 2385) ERROR: Giza did not produce the output file
>> > > train/giza.de-en/de-en.A3.final. Is your corpus clean
>> (reasonably-sized
>> > > sentences)? at
>> /home/heather/mosesdecoder/dist/training/train-model.perl
>> > > line 1077.
>>
>> in your log file. From the log file and directory listings you gave me, it
>> seemed that giza *did* produce its output (ie
>> train/giza.de-en/de-en.A3.final
>> and the equivalent in the other direction). The sentence mismatch errors
>> indicates that the forward and backward giza outputs are not compatible.
>>
>> I wonder if you had some files left over from previous run?
>>
>> Could you try running from train-model.perl the start on a completely
>> clean
>> directory, and if it fails, post the output? If the first failure is a
>> sentence
>> mismatch error, maybe you could post the de-en.A3.final.gz and
>> en-de.A3.final.gz
>> files,
>>
>> cheers - Barry
>>
>>
>> On Thursday 17 May 2012 04:06:21 Heather Macbeth wrote:
>> > Hi Barry and Moses-support,
>> >
>> > Thanks for getting back to me.  In answer to Barry's questions,
>> > * I'm using giza, not mgiza.
>> > * I've put my training.out at 
>> > math.princeton.edu/~macbeth/training.out<http://math.princeton.edu/%7Emacbeth/training.out>
>> > * I've listed the files produced by the script at the end of this email.
>> >
>> > Of the two errors I asked about earlier, one remains:  In Step 3, a
>> > "sentence mismatch error" on almost every sentence.  (Lines
>> >  #41604-#158020.)
>> >
>> > I noticed an earlier thread
>> > http://www.mail-archive.com/[email protected]/msg02130.html
>> > in which something similar was reported.  There, Felipe Sánchez
>> > Martínezsuggested cleaning the corpus as a fix.  I had done that.  I'd
>> > be very
>> > grateful for suggestions of other things to investigate.  Or is this
>> > something I shouldn't worry about?
>> >
>> > (I had also asked about another ERROR:  in Step 2, the "Giza did not
>> > produce the output file."  It didn't recur on re-running, and I think it
>> > may be fixed -- I believe the problem last time might have been not
>> > stripping the working directory of files from previous partial runs.)
>> >
>> > Sincerely,
>> > Heather Macbeth
>> >
>> >
>> > ** Files produced in working directory by Steps 1-3 **
>> >
>> > .:
>> > train
>> > training [this is a script of mine]
>> > training.out
>> >
>> > ./train:
>> > corpus
>> > giza.de-en
>> > giza.en-de
>> > model
>> >
>> > ./train/corpus:
>> > de-en-int-train.snt
>> > de.vcb
>> > de.vcb.classes
>> > de.vcb.classes.cats
>> > en-de-int-train.snt
>> > en.vcb
>> > en.vcb.classes
>> > en.vcb.classes.cats
>> >
>> > ./train/giza.de-en:
>> > de-en.A3.final.gz
>> > de-en.cooc
>> > de-en.gizacfg
>> >
>> > ./train/giza.en-de:
>> > en-de.A3.final.gz
>> > en-de.cooc
>> > en-de.gizacfg
>> >
>> > ./train/model:
>> > aligned.grow-diag-final-and
>> >
>> > On Wed, May 16, 2012 at 3:50 PM, Barry Haddow
>> <[email protected]>wrote:
>> > > Hi Heather
>> > >
>> > > Could you post the training.out file (or at least the step 2 part of
>> it)?
>> > >
>> > > Are you using giza or mgiza?
>> > >
>> > > What files did giza produce?
>> > >
>> > > Cheers - Barry
>> > >
>> > > Sent from my ZX81
>> > >
>> > >
>> > > ----- Reply message -----
>> > > From: "Heather Macbeth" <[email protected]>
>> > > Date: Wed, May 16, 2012 20:04
>> > > Subject: [Moses-support] "Giza did not produce the output file" -- on
>> > > cleaned corpus
>> > > To: <[email protected]>
>> > >
>> > > Hi Moses-support,
>> > >
>> > > I'm looking for help on a problem that arose while building a baseline
>> > > system.  Apart from changing FR to DE, I've tried to follow the
>> > > instructions http://www.statmt.org/moses/?n=Moses.Baseline exactly.
>> > >
>> > > When I run the script train-model, the transcript training.out reports
>> > >
>> > > (line 2385) ERROR: Giza did not produce the output file
>> > > train/giza.de-en/de-en.A3.final. Is your corpus clean
>> (reasonably-sized
>> > > sentences)? at
>> /home/heather/mosesdecoder/dist/training/train-model.perl
>> > > line 1077.
>> > > (line 3285) ERROR: Giza did not produce the output file
>> > > train/giza.en-de/en-de.A3.final. Is your corpus clean
>> (reasonably-sized
>> > > sentences)? at
>> /home/heather/mosesdecoder/dist/training/train-model.perl
>> > > line 1077.
>> > >
>> > > (I had indeed cleaned the corpus as instructed.  The output concluded
>> > > with Input sentences: 158840  Output sentences:  158020
>> > > So I take it this step went through ok.)
>> > >
>> > > It seems that people have had this problem before, for instance
>> > > http://www.mail-archive.com/[email protected]/msg03434.html
>> > >
>> > > Barry Haddow's suggestion in that thread was to "have a look at the
>> giza
>> > > log file to see what went wrong. Maybe the merging of alignments
>> failed."
>> > > Does "giza log file" mean the Step 2 part of training.out?  If so,
>> I've
>> > > tried this, but I'm not exactly sure what I'm looking for.  There are
>> a
>> > > lot of WARNINGS, mainly of the form "already N iterations in
>> hillclimb,"
>> > > but no other errors.
>> > >
>> > > Any suggestions for what symptoms to look for in the giza log file
>> would
>> > > be very welcome.
>> > >
>> > >
>> > > In case it's relevant, let me mention another error that happens later
>> > > (which I assume is a consequence of the first error):  during word
>> > > alignments, a "sentence mismatch error" on almost every sentence.
>>  Here's
>> > > the relevant part of the transcript:  at the beginning of Step 3
>> (around
>> > > line 5500):
>> > >
>> > > (3) generate word alignment @ Mon May 14 05:17:23 EDT 2012
>> > > Combining forward and inverted alignment from files:
>> > >  train/giza.de-en/de-en.A3.final.{bz2,gz}
>> > >  train/giza.en-de/en-de.A3.final.{bz2,gz}
>> > > Executing: mkdir -p train/model
>> > > Executing: /home/heather/mosesdecoder/dist/training/symal/giza2bal.pl-d
>> > > "gzip -cd train/giza.en-de/en-de.A3.final.gz" -i "gzip -cd
>> > > train/giza.de-en/de-en.A3.final.gz"
>> > >
>> > > |/home/heather/mosesdecoder/dist/training/symal/symal
>> -alignment="grow"
>> > >
>> > > -diagonal="yes" -final="yes" -both="yes" >
>> > > train/model/aligned.grow-diag-final-and
>> > > symal: computing grow alignment: diagonal (1) final (1)both-uncovered
>> (1)
>> > > Sentence mismatch error! Line #16665
>> > > Sentence mismatch error! Line #16666
>> > > Sentence mismatch error! Line #16667
>> > > Sentence mismatch error! Line #16668
>> > > Sentence mismatch error! Line #16669
>> > > ....
>> > > Sentence mismatch error! Line #158018
>> > > Sentence mismatch error! Line #158019
>> > > Sentence mismatch error! Line #158020
>> > >
>> > >
>> > > Sincerely,
>> > > Heather Macbeth
>> > >
>> > >
>> > >
>> > > The University of Edinburgh is a charitable body, registered in
>> > > Scotland, with registration number SC005336.
>> >
>>
>> --
>> Barry Haddow
>> University of Edinburgh
>> +44 (0) 131 651 3173
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to