Dear all,

I've built moses and everything seems to be working. However, things
go wrong during the training phrase. phrase-table and the sorted.gz
files are exactly 20 bytes, while all other files in the training directory 
look good. I've tried
this with all files, even the sample files, and the result is always
the same: the translation is basically the source text, unchanged, and
phrase-table is empty.

As this even happens with the sample files, it doesn't seem to be a
formatting issue. The training log starts generating lots and lots of
errors, at least one for each line, when LexicalTranslationModel.pm is
started:
uninitialized value $ei in array element
uninitialized value $ei in numeric ge>=
uninitialized value $ei in hash element
et cetera

The training directory is clean, so moses is not trying to use old
files (unless I'm overlooking something).

~/mosesdecoder/bin contains the compiled bin files. No issues there.

My source language is en, my target language is nl (Dutch). I'm using
"-f en -e nl" everywhere.

The files are cleaned with clean-corpus-n.perl

This happens both with GIZA and MGIZA.

What am I missing?

TOKENIZATION MAIN
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < 
~/corpus/train/client_main.en > ~/corpus/client_main.tok.en

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l nl < 
~/corpus/train/client_main.nl > ~/corpus/client_main.tok.nl

TRUECASER TRAINING
~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
~/corpus/truecase-model.en --corpus ~/corpus/client_main.tok.en

~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
~/corpus/truecase-model.nl --corpus ~/corpus/client_main.tok.nl

TRUECASING
~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.en 
< ~/corpus/client_main.tok.en > ~/corpus/client.true.en

~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.nl 
< ~/corpus/client_main.tok.nl > ~/corpus/client.true.nl

CLEANING
~/mosesdecoder/scripts/training/clean-corpus-n.perl ~/corpus/client.true en nl 
~/corpus/client.clean 1 80

LANGUAGE MODEL VANUIT lm (empty first, only for target, 3-gram)
~/mosesdecoder/bin/lmplz -o 3 <~/corpus/client.true.nl > client.arpa.nl

BINARIZING LANGUAGE MODEL for speed
~/mosesdecoder/bin/build_binary client.arpa.nl client.blm.nl

TRAINING FROM working (empty first)
~/mosesdecoder/scripts/training/train-model.perl -mgiza -root-dir train -corpus 
~/corpus/client.clean -f en -e nl -alignment grow-diag-final-and -reordering 
msd-bidirectional-fe -lm 0:3:$HOME/lm/client.arpa.nl:8 -external-bin-dir 
~/mosesdecoder/tools -cores 4 >& training.out &

Result: moses.ini in ~/working/train/model
Phrase-table is 20 bytes

TUNING TOKENIZATION
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < 
~/corpus/train/client_tune.en > ~/corpus/client_tune.tok.en

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l nl < 
~/corpus/train/client_tune.nl > ~/corpus/client_tune.tok.nl

TUNING TRUECASING
~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.en 
< ~/corpus/client_tune.tok.en > ~/corpus/client_val.true.en

~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.nl 
< ~/corpus/client_tune.tok.nl > ~/corpus/client_val.true.nl

TUNING from working
~/mosesdecoder/scripts/training/mert-moses.pl ~/corpus/client.true.en 
~/corpus/client.true.nl ~/mosesdecoder/bin/moses train/model/moses.ini 
--no-filter-phrase-table --mertdir ~/mosesdecoder/bin/ &> mert.out &

Result: moses.ini in ~/working/mert-work/moses.ini

TESTING
~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini

Translation never contains Dutch words. Only words in source language or 
unknowns, even
with sample files (where fr was source and en was target).

TRANSLATE from tevertalen
~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini < in > out

Best regards,

Loek van Kooten

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to