Hi, GIZA++ has a limit on 100 words per sentence. It usually makes little sense to include sentences longer than 60 words in training, since the word alignment is difficult to compute.
-phi On Wed, Oct 7, 2009 at 6:37 PM, Danish Contractor <[email protected]> wrote: > Hi, > > Thanks for the reply. Yes, I did run the clean-corpus-n.perl script. > I also had to replace all occurrences of "|" in the hindi text with another > character as it seems "|" is of special significance to the scripts. > > The "|" is used in the hindi language as a full stop ("." --- end of > sentence marker). > > Could you please let me know if there is a limit on the max length of > sentences - I gave a length of 1 - 60 while running the script. > In addition, is there any limit on the max allowable difference in sentence > length of the parallel text? > > Thanks. > --Danish > > On Wed, Oct 7, 2009 at 6:41 PM, Philipp Koehn <[email protected]> wrote: >> >> Hi, >> >> the problem lies in the word alignment step (step 3) - you can run the >> step in >> isolation to check in more detail about what is going wrong. >> >> One common problem with word alignment is that GIZA++ is sensititive >> to bad data, i.e. empty lines, long sentences, or excessive mismatch >> in sentence length. The clean-corpus-n.perl script is designed to take >> care of these problems. Did you run this on your original corpus? >> >> -phi >> >> On Sun, Oct 4, 2009 at 6:32 AM, Danish Contractor >> <[email protected]> wrote: >> > Hi, >> > >> > I have compiled Moses,Giza & SRILM on Fedora Core 11 using the steps >> > described in http://www.statmt.org/moses_steps.html and other moses >> > support >> > links. >> > >> > While training my parallel corpus of english and hindi (~100,000 >> > sentences >> > each) I get an error as shown below when i execute: >> > >> > nohup nice >> > >> > ./tools/moses-scripts/scripts-20091002-0031//training/train-factored-phrase-model.perl >> > -scripts-root-dir ./tools/moses-scripts/scripts-20091002-0031/ -root-dir >> > work3 -corpus ./work3/corpus/IRL-clean -f hi2 -e en2 -alignment >> > grow-diag-final-and -reordering msd-bidirectional-fe -lm >> > 0:3:/home/danish/FIRE2010/work3/lm/IRL-en.lm >& ./work3/training.out & >> > >> > In one step of the training process, I get the following error and the >> > tools >> > quits: >> > >> > Last few lines of output (training.out) : >> > >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > Use of uninitialized value $a in scalar chomp at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 853. >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > Use of uninitialized value $a in scalar chomp at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 853. >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > Use of uninitialized value $a in scalar chomp at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 853. >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > Use of uninitialized value $a in scalar chomp at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 853. >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > Use of uninitialized value $a in scalar chomp at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 853. >> > Use of uninitialized value $a in split at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 856. >> > >> > Saved: ./work3//model/lex.f2e and ./work3//model/lex.e2f >> > FILE: ./work3/corpus/IRL-clean.en2 >> > FILE: ./work3/corpus/IRL-clean.hi2 >> > FILE: ./work3//model/aligned.grow-diag-final-and >> > (5) extract phrases @ Sat Oct 3 02:46:00 IST 2009 >> > >> > ./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract >> > ./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2 >> > ./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7 >> > --NoFileLimit orientation >> > Executing: >> > >> > ./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract >> > ./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2 >> > ./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7 >> > --NoFileLimit orientation >> > PhraseExtract v1.4, written by Philipp Koehn >> > phrase extraction from an aligned parallel corpus >> > .........Executing: gzip ./work3//model/extract.inv >> > gzip: ./work3//model/extract.inv: No such file or directory >> > Exit code: 1 >> > ERROR at >> > >> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl >> > line 963. >> > >> > >> > My clean sentence files are with the extension hi2 (for hindi) and en2 >> > (for >> > english). >> > I have tried solutions available on moses support forums for similar >> > problems, but they have not helped. >> > >> > The following is a listing of the files & folders in my work folder >> > (work3) >> > >> > corpus folder >> > total 76384 >> > -rw-rw-r--. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2 >> > -rw-rw-r--. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2 >> > -rw-r--r--. 1 root root 1781671 2009-10-03 17:44 hi2.vcb.classes >> > -rw-r--r--. 1 root root 1579583 2009-10-03 17:44 >> > hi2.vcb.classes.cats >> > -rw-r--r--. 1 root root 704087 2009-10-03 17:50 en2.vcb.classes >> > -rw-r--r--. 1 root root 534277 2009-10-03 17:50 >> > en2.vcb.classes.cats >> > -rw-r--r--. 1 root root 2158362 2009-10-03 17:50 hi2.vcb >> > -rw-r--r--. 1 root root 1013926 2009-10-03 17:50 en2.vcb >> > -rw-r--r--. 1 root root 15605740 2009-10-03 17:50 >> > hi2-en2-int-train.snt >> > -rw-r--r--. 1 root root 15605740 2009-10-03 17:51 >> > en2-hi2-int-train.snt >> > >> > giza.en2-hi2 folder >> > total 124088 >> > -rw-r--r--. 1 root root 109989326 2009-10-03 18:44 en2-hi2.cooc >> > -rw-r--r--. 1 root root 1651 2009-10-03 18:44 en2-hi2.gizacfg >> > -rw-r--r--. 1 root root 17070807 2009-10-03 19:22 en2-hi2.A3.final.gz >> > >> > giza.hi2-en2 folder >> > total 124052 >> > -rw-r--r--. 1 root root 110088686 2009-10-03 17:51 hi2-en2.cooc >> > -rw-r--r--. 1 root root 1651 2009-10-03 17:51 hi2-en2.gizacfg >> > -rw-r--r--. 1 root root 16928263 2009-10-03 18:43 hi2-en2.A3.final.gz >> > >> > lm folder >> > total 100388 >> > -rw-rw-r--. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2 >> > -rw-rw-r--. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2 >> > -rw-r--r--. 1 root root 22834140 2009-10-03 17:29 IRL-en.lm >> > -rw-r--r--. 1 root root 40731568 2009-10-03 17:30 IRL-hi.lm >> > >> > model folder >> > total 7992 >> > -rw-r--r--. 1 root root 0 2009-10-03 19:23 >> > aligned.grow-diag-final-and >> > -rw-r--r--. 1 root root 4089006 2009-10-03 19:23 lex.f2e >> > -rw-r--r--. 1 root root 4089006 2009-10-03 19:23 lex.e2f >> > >> > I can see the model folder does not contain the extract.inv file which >> > seems >> > to cause the error. I have re-done the steps thrice and face the exact >> > same >> > error each time. >> > >> > I have ensured that the parallel text has been lower cased (for english) >> > and >> > cleaned (english & hindi both). >> > May I request you to kindly help me resolve this issue at the earliest. >> > Thanks! >> > >> > Thank you, >> > Regards, >> > >> > Danish Contractor >> > >> > >> > >> > >> > >> > _______________________________________________ >> > Moses-support mailing list >> > [email protected] >> > http://mailman.mit.edu/mailman/listinfo/moses-support >> > >> > > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
