Hi, GIZA requires not only that there are no sentences with more than 100 words, it also requires that each sentence pair has at most a ratio of 9:1 between words in each sentence. Otherwise, GIZA++ truncates sentences which leads to problems further down the processing pipeline.
The script clean-corpus-n.perl does all the required data cleaning, and as Hieu suggests, it may be a good idea to use experiment.perl which by default triggers all the necessary steps. -phi On Tue, Jun 26, 2012 at 9:07 AM, <[email protected]> wrote: > Hi, > > I am trying to run Moses by following the baseline system for my own data. > After the step "Training the Translation System," the file training.out is > created. After tuning and testing, I decided to check the training.out file. > A segment of it (it's really long) is attached at the bottom of this message. > I noticed two issues. > > First, I keep getting a "[Hill Climb / Model2 / ...]viterbi alignment has > zero score" warning. What does this warning mean, and how can I rectify this? > A quick search of the mailing archives revealed that someone solved this > issue by reducing the size of the maximum sentence length of the corpus, but > this doesn't explain why the warnings occur. > > Second, and the more important of my two questions, my GIZA word alignment > files do not seem to be of the same length. When I ran the cleaning (the step > that the baseline system says to do right before the "Language Model > Training"), I limited sentence lengths to be 100. But after checking my two > alignment files (foreign -> english and english -> foreign), some sentences > are missing in the foreign -> english file. I read in the mailing archives > that someone said cleaning the corpus would solve this issue, but I am sure > that I ran the cleaning script on the data to limit sentence length to 100 > words before training (i.e running GIZA++ on it). So what else could be the > issue? Many mismatch errors exist because one missing sentence throws future > alignments off by 1. > > Thanks for any help you can offer. I've put a segment of the "training.out" > file at the bottom, with [...] indicating that there were many lines that I > did not copy & paste because of the vast repetition. > > Sincerely, > Daniel Seita > > > [...] > 0.0114939 j:74 i:8; NP 6.31447e-06 AP1 0.01262 j:75 i:6; NP 3.15751e-06 > AP1 0.0110491 j:76 i:12; NP 3.15751e-06 AP1 0.0106323 j:77 i:11; NP > 3.15751e-06 AP1 0.535517 j:78 i:12; NP 3.15751e-06 AP1 0.0104985 j:79 i:6; > NP 0.358595 AP1 0.0117257 j:80 i:4; NP 2.10507e-06 AP1 0.0104082 j:81 i:11; > NP 0.000320498 AP1 0.011022 j:82 i:6; NP 1.07495e-05 AP1 0.0112897 AP2 > 0.02392 j:83 i:11; > WARNING: Hill Climbing yielded a zero score viterbi alignment for the > following pair: > AL(l:15,m:84)(a: 14 0 14 14 14 14 14 0 0 14 7 14 0 14 6 1 0 5 0 0 5 0 0 15 15 > 15 1 15 6 15 1 15 15 5 15 15 7 11 8 3 2 3 5 4 13 2 2 5 1 2 1 2 2 3 1 1 3 3 1 > 2 11 3 1 5 13 7 5 6 5 2 7 7 7 2 9 7 13 12 13 7 5 12 7 12 )(fert: 9 9 9 6 1 9 > 3 9 1 1 0 2 3 4 9 9 ) c: > Source sentence length : 15 , target : 84 > 20 169 19 92 5 19 4 20 116 29 75 906 89 33643 3 > 116639 6 1069 213 247 5372 24011 17 19 2319 3 5328 6 1112 21 28 6405 5 2 4332 > 5 24011 17 19 178 7 112 8313 20 1042 106 2 1563 5 37 4189 60 8 2135 29 24011 > 111119 6 47 72 26118 5603 6 2 970 296 68 12 36604 249 538 700 305 4327 2680 > 366 305 3288 5 2680 3 2053 24011 6509 12 24011 10624 3 12 16215 201 249 538 > 700 3288 5 305 4 12 > WARNING: Model2 viterbi alignment has zero score. > Here are the different elements that made this alignment probability zero > Source length 15 target length 95 > best: fs[1] 1 : es[1] 1 , a: 0.897485 t: 0.00395663 score 0.00355102 > product : 0.00355102 ss 0 > best: fs[2] 2 : es[10] 10 , a: 0.00624286 t: 0.0126491 score 7.89665e-05 > product : 2.80411e-07 ss 0 > best: fs[3] 3 : es[0] 0 , a: 0.0574229 t: 0.120728 score 0.00693258 > product : 1.94397e-09 ss 0 > best: fs[4] 4 : es[12] 12 , a: 0.00893218 t: 0.987925 score 0.00882433 > product : 1.71542e-11 ss 0 > [...] > WARNING: Model2 viterbi alignment has zero score. > Here are the different elements that made this alignment probability zero > Source length 9 target length 78 > best: fs[1] 1 : es[3] 3 , a: 0.0109157 t: 0.970912 score 0.0105982 product > : 0.0105982 ss 0 > best: fs[2] 2 : es[2] 2 , a: 0.786017 t: 1e-07 score 7.86017e-08 product : > 8.33033e-10 ss 0 > best: fs[3] 3 : es[3] 3 , a: 0.678615 t: 1e-07 score 6.78615e-08 product : > 5.65309e-17 ss 0 > [...] > Executing: rm -f > /home/dseita/KauchakWorking/train/giza.norm-simp/norm-simp.A3.final.gz > Executing: gzip > /home/dseita/KauchakWorking/train/giza.norm-simp/norm-simp.A3.final > Waiting for second GIZA process... > (3) generate word alignment @ Mon Jun 25 18:00:13 EDT 2012 > Combining forward and inverted alignment from files: > /home/dseita/KauchakWorking/train/giza.norm-simp/norm-simp.A3.final.{bz2,gz} > /home/dseita/KauchakWorking/train/giza.simp-norm/simp-norm.A3.final.{bz2,gz} > Executing: mkdir -p /home/dseita/KauchakWorking/train/model > Executing: /home/dseita/mosesdecoder/scripts/training/giza2bal.pl -d "gzip > -cd /home/dseita/KauchakWorking/train/giza.simp-norm/simp-norm.A3.final.gz" > -i "gzip -cd > /home/dseita/KauchakWorking/train/giza.norm-simp/norm-simp.A3.final.gz" > |/home/dseita/mosesdecoder/scripts/../bin/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > > /home/dseita/KauchakWorking/train/model/aligned.grow-diag-final > symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0) > Sentence mismatch error! Line #86 > Sentence mismatch error! Line #87 > Sentence mismatch error! Line #88 > Sentence mismatch error! Line #89 > Sentence mismatch error! Line #90 > Sentence mismatch error! Line #91 > Sentence mismatch error! Line #92 > Sentence mismatch error! Line #93 > Sentence mismatch error! Line #94 > [...Mismatch errors continue...] > [...] > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
