Hi Nakul,
It looks like clean-corpus-n.perl has identified some bad data (possibly broken UTF-8). The script train-factored-phrase-model.perl comes with older revisions of Moses. The current distribution uses train-model.perl. However, for both train-xxx.perl scripts and the instructions at http://www.statmt.org/moses_steps.html [1], require you to manually copy mkcls, GIZA++, & snt2cooc.out to the 'bin' folder that you created. To get a better understanding of how all the tools work together, please consider installing all the packages from DoMY CE (Do Moses Yourself). DoMY CE is an open source packaged distribution of all Moses components including GIZA++, MGIZA++, IRSTLM, and RandLM (not SRILM because it's not distributed as open source). DoMY CE automatically places all the files in the necessary locations, including mkcks, GIZA++, & snt2cooc.out. By default, DoMY CE configures the system to use MGIZA++. You can easily study the scripts to identify where to revert to GIZA++ if necessary. The current version of DoMY CE is a PPA distribution available only for Ubuntu (I see you're using 10.04 LTS, which is our development environment). You can register as a user at http://www.precisiontranslationtools.com [2] to view the download/installation instructions. Best regards, Tom On Tue, 1 Feb 2011 09:50:58 +0530, nakul sharma wrote: Hi Barry, ./clean-corpus-n.perl in truck/scripts/training returned following error:- ./clean-corpus-n.perl corpus/* txt txt clean 1 50 clean-corpus.perl: processing corpus/200EnglishSens.txt.corpus/200HindiSens.txt & .txt to txt, cutoff clean-1 Use of uninitialized value $opn in open at ./clean-corpus-n.perl line 46. Use of uninitialized value $opn in concatenation (.) or string at ./clean-corpus-n.perl line 46. Can't open '' at ./clean-corpus-n.perl line 46. using train-factored-phrase-model.perl returned following error:- Using SCRIPTS_ROOTDIR: /home/nakul/mosesdecoder/trunk/scripts Using single-thread GIZA ERROR: Cannot find mkcls, GIZA++, & snt2cooc.out in . Did you install this script using 'make release'? at ./train-factored-phrase-model.perl line 205. it seems that moses does not recognize GIZA++ and mkcls. they are installed in different directories. i want to train them separately. is it possible to do so ? Regarding vcb file i got it by executing following command :- sudo ./plain2snt.out 200ESens.txt 200HSens.txt creates en.vcb, hn.vcb and bit text files (200ESens_200HSens.snt, 200HSens_200ESens.snt) in GIZA++ format. -- Thanks & Regards nakul. On Mon, Jan 31, 2011 at 3:54 PM, Barry Haddow wrote: Hi Nakul Clean corpus will get rid of long lines and lines with a high length ratio, which giza doesn't like. This could fix your first error. Run ./clean-corpus-n,perl --help for usage instructions. As to the second error, if you're not using the moses scripts, how did you create the vcb files? It looks as though they don't match the corpus, best regards - Barry On Monday 31 January 2011 10:17, nakul sharma wrote: > Hi Barry, > > i am not training giza through moses. i am training it independently. Will > it make any difference ? Anyways i do not have clean-corpus-n.perl in > giza. please tell what to do of it ? > > On Mon, Jan 31, 2011 at 3:07 PM, Barry Haddow wrote: > > Hi Nakul > > > > Did you clean your corpus first (ie run clean-corpus-n.perl over it) ? > > > > best regards - Barry > > > > On Monday 31 January 2011 04:20, nakul sharma wrote: > > > hi all, > > > > > > i have having g++ version 4.4.3 and ubuntu 10.04 LTS, while training > > > GIZA++, i get following error upon execution of GIZA++ exe file:- > > > > > > Reading vocabulary file from:200ESens.vcb > > > Reading vocabulary file from:200HSens.vcb > > > {WARNING:(a)truncated sentence 0}{WARNING:(a)truncated sentence > > > > 1}WARNING: > > > The following sentence pair has source/target sentence length ration > > > more than the maximum allowed limit for a source word fertility > > > source length = 1 target length = 11 ratio 11 ferility limit : 9 > > > Shortening sentence > > > Sent No: 3 , No. Occurrences: 1 > > > 0 254 > > > 57 5 3 58 59 60 5 61 62 63 64 > > > > > > like this for almost all the Sent No, i get this warning and then for a > > > sentence number 98 i get this error message:- > > > > > > Sent No: 98 , No. Occurrences: 1 > > > 0 457 458 > > > 909 910 15 911 17 86 912 913 65 3 914 915 22 916 11 917 170 162 918 919 > > > 3 684 22 8 920 921 22 8 333 922 923 924 22 925 > > > ERROR: target word 937 is not in the vocabulary list. > > > > > > Giza++ has generated only one file **.root.gfcs. > > > > > > Please tell how to deal with this problem. > > > > -- > > The University of Edinburgh is a charitable body, registered in > > Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Links: ------ [1] http://www.statmt.org/moses_steps.html [2] http://www.precisiontranslationtools.com [3] mailto:[email protected] [4] mailto:[email protected]
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
