Re: [Moses-support] Errors training GIZA++

Tom Hoar Mon, 31 Jan 2011 20:42:24 -0800


Hi Nakul,

It looks like clean-corpus-n.perl has identified some bad
data (possibly broken UTF-8). 

The script
train-factored-phrase-model.perl comes with older revisions of Moses.
The current distribution uses train-model.perl. However, for both
train-xxx.perl scripts and the instructions at
http://www.statmt.org/moses_steps.html [1], require you to manually copy
mkcls, GIZA++, & snt2cooc.out to the 'bin' folder that you created. 

To
get a better understanding of how all the tools work together, please
consider installing all the packages from DoMY CE (Do Moses Yourself).
DoMY CE is an open source packaged distribution of all Moses components
including GIZA++, MGIZA++, IRSTLM, and RandLM (not SRILM because it's
not distributed as open source). 

DoMY CE automatically places all the
files in the necessary locations, including mkcks, GIZA++, &
snt2cooc.out. By default, DoMY CE configures the system to use MGIZA++.
You can easily study the scripts to identify where to revert to GIZA++
if necessary. 

The current version of DoMY CE is a PPA distribution
available only for Ubuntu (I see you're using 10.04 LTS, which is our
development environment). You can register as a user at
http://www.precisiontranslationtools.com [2] to view the
download/installation instructions. 

Best regards,
Tom 

On Tue, 1 Feb
2011 09:50:58 +0530, nakul sharma  wrote:  

Hi
Barry,
./clean-corpus-n.perl in truck/scripts/training returned
following error:-

./clean-corpus-n.perl corpus/* txt txt clean 1
50
clean-corpus.perl: processing
corpus/200EnglishSens.txt.corpus/200HindiSens.txt & .txt to txt, cutoff
clean-1
 Use of uninitialized value $opn in open at
./clean-corpus-n.perl line 46.
Use of uninitialized value $opn in
concatenation (.) or string at ./clean-corpus-n.perl line 46.
Can't open
'' at ./clean-corpus-n.perl line 46.

using
train-factored-phrase-model.perl returned following error:-

Using
SCRIPTS_ROOTDIR: /home/nakul/mosesdecoder/trunk/scripts
Using
single-thread GIZA
ERROR: Cannot find mkcls, GIZA++, & snt2cooc.out in
.
 Did you install this script using 'make release'? at
./train-factored-phrase-model.perl line 205.

it seems that moses does
not recognize GIZA++ and mkcls. they are installed in different
directories. i want to train them separately. is it possible to do so ?
Regarding vcb file i got it by executing following command :-

sudo
./plain2snt.out 200ESens.txt 200HSens.txt 

creates en.vcb, hn.vcb and
bit text files (200ESens_200HSens.snt, 200HSens_200ESens.snt) in GIZA++
format.

--
Thanks & Regards 
nakul.

On Mon, Jan 31, 2011 at 3:54 PM,
Barry Haddow  wrote:
 Hi Nakul

 Clean corpus will get rid of long lines
and lines with a high length ratio,
 which giza doesn't like. This could
fix your first error.

 Run ./clean-corpus-n,perl --help for usage
instructions.

 As to the second error, if you're not using the moses
scripts, how did you
 create the vcb files? It looks as though they
don't match the corpus,

 best regards - Barry

 On Monday 31 January
2011 10:17, nakul sharma wrote:
 > Hi Barry,
 >
 > i am not training
giza through moses. i am training it independently. Will
 > it make any
difference ? Anyways i do not have clean-corpus-n.perl in
 > giza.
please tell what to do of it ?
 >
 > On Mon, Jan 31, 2011 at 3:07 PM,
Barry Haddow  wrote:
 > > Hi Nakul
 > >
 > > Did you clean your corpus
first (ie run clean-corpus-n.perl over it) ?
 > >
 > > best regards -
Barry
 > >
 > > On Monday 31 January 2011 04:20, nakul sharma wrote:
 >
> > hi all,
 > > >
 > > > i have having g++ version 4.4.3 and ubuntu
10.04 LTS, while training
 > > > GIZA++, i get following error upon
execution of GIZA++ exe file:-
 > > >
 > > > Reading vocabulary file
from:200ESens.vcb
 > > > Reading vocabulary file from:200HSens.vcb
 > >
> {WARNING:(a)truncated sentence 0}{WARNING:(a)truncated sentence
 > >

> > 1}WARNING:
 > > > The following sentence pair has source/target
sentence length ration
 > > > more than the maximum allowed limit for a
source word fertility
 > > > source length = 1 target length = 11 ratio
11 ferility limit : 9
 > > > Shortening sentence
 > > > Sent No: 3 , No.
Occurrences: 1
 > > > 0 254
 > > > 57 5 3 58 59 60 5 61 62 63 64
 > > >

> > > like this for almost all the Sent No, i get this warning and then
for a
 > > > sentence number 98 i get this error message:-
 > > >
 > > >
Sent No: 98 , No. Occurrences: 1
 > > > 0 457 458
 > > > 909 910 15 911
17 86 912 913 65 3 914 915 22 916 11 917 170 162 918 919
 > > > 3 684 22
8 920 921 22 8 333 922 923 924 22 925
 > > > ERROR: target word 937 is
not in the vocabulary list.
 > > >
 > > > Giza++ has generated only one
file **.root.gfcs.
 > > >
 > > > Please tell how to deal with this
problem.
 > >
 > > --
 > > The University of Edinburgh is a charitable
body, registered in
 > > Scotland, with registration number SC005336.

--
 The University of Edinburgh is a charitable body, registered in

Scotland, with registration number SC005336.

-- 

Links:
------
[1]
http://www.statmt.org/moses_steps.html
[2]
http://www.precisiontranslationtools.com
[3]
mailto:[email protected]
[4] mailto:[email protected]

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Errors training GIZA++

Reply via email to