Dear all, I've built moses and everything seems to be working. However, things go wrong during the training phrase. phrase-table and the sorted.gz files are exactly 20 bytes, while all other files in the training directory look good. I've tried this with all files, even the sample files, and the result is always the same: the translation is basically the source text, unchanged, and phrase-table is empty.
As this even happens with the sample files, it doesn't seem to be a formatting issue. The training log starts generating lots and lots of errors, at least one for each line, when LexicalTranslationModel.pm is started: uninitialized value $ei in array element uninitialized value $ei in numeric ge>= uninitialized value $ei in hash element et cetera The training directory is clean, so moses is not trying to use old files (unless I'm overlooking something). ~/mosesdecoder/bin contains the compiled bin files. No issues there. My source language is en, my target language is nl (Dutch). I'm using "-f en -e nl" everywhere. The files are cleaned with clean-corpus-n.perl This happens both with GIZA and MGIZA. What am I missing? TOKENIZATION MAIN ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ~/corpus/train/client_main.en > ~/corpus/client_main.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l nl < ~/corpus/train/client_main.nl > ~/corpus/client_main.tok.nl TRUECASER TRAINING ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/truecase-model.en --corpus ~/corpus/client_main.tok.en ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/truecase-model.nl --corpus ~/corpus/client_main.tok.nl TRUECASING ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.en < ~/corpus/client_main.tok.en > ~/corpus/client.true.en ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.nl < ~/corpus/client_main.tok.nl > ~/corpus/client.true.nl CLEANING ~/mosesdecoder/scripts/training/clean-corpus-n.perl ~/corpus/client.true en nl ~/corpus/client.clean 1 80 LANGUAGE MODEL VANUIT lm (empty first, only for target, 3-gram) ~/mosesdecoder/bin/lmplz -o 3 <~/corpus/client.true.nl > client.arpa.nl BINARIZING LANGUAGE MODEL for speed ~/mosesdecoder/bin/build_binary client.arpa.nl client.blm.nl TRAINING FROM working (empty first) ~/mosesdecoder/scripts/training/train-model.perl -mgiza -root-dir train -corpus ~/corpus/client.clean -f en -e nl -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/client.arpa.nl:8 -external-bin-dir ~/mosesdecoder/tools -cores 4 >& training.out & Result: moses.ini in ~/working/train/model Phrase-table is 20 bytes TUNING TOKENIZATION ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ~/corpus/train/client_tune.en > ~/corpus/client_tune.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l nl < ~/corpus/train/client_tune.nl > ~/corpus/client_tune.tok.nl TUNING TRUECASING ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.en < ~/corpus/client_tune.tok.en > ~/corpus/client_val.true.en ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/truecase-model.nl < ~/corpus/client_tune.tok.nl > ~/corpus/client_val.true.nl TUNING from working ~/mosesdecoder/scripts/training/mert-moses.pl ~/corpus/client.true.en ~/corpus/client.true.nl ~/mosesdecoder/bin/moses train/model/moses.ini --no-filter-phrase-table --mertdir ~/mosesdecoder/bin/ &> mert.out & Result: moses.ini in ~/working/mert-work/moses.ini TESTING ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini Translation never contains Dutch words. Only words in source language or unknowns, even with sample files (where fr was source and en was target). TRANSLATE from tevertalen ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini < in > out Best regards, Loek van Kooten _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support