Dear all,
I'm having some difficulties to train the recasing model with IRSTLM. I changed the train-recaser script according to http://www.mail-archive.com/[email protected]/msg01934.html but this results in an error which I don't know how to fix. Error log: ----------------------------------------------------------------------- (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011 /home/user/mosestools/scripts-20111024-1127/training/train-model.perl --root-dir /home/user/moses/work/recaser --model-dir /home/user/moses/work/recaser --first-step 4 --alignment a --corpus /home/user/moses/work/recaser/aligned --f lowercased --e cased --max-phrase-length 1 --lm 0:3:/home/user/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir /home/user/moses/mosestools/scripts-20111024-1127 Can't exec "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl": No such file or directory at ./train-recaser.perl line 95. (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011 ----------------------------------------------------------------------- Then instead of using build-lm.sh, I gave it another try calling compile-lm directly: my $cmd = "/home/user/moses/mosestools/irstlm-5.60.03/bin/compile-lm $CORPUS /dev/stdout | gzip -c > $DIR/cased.irstlm.gz where $CORPUS is a gzip iARPA file. Error log: ----------------------------------------------------------------------- (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26 CET 2011 /home/nexoc/moses/work/recaser/aligned.lowercased utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, <CORPUS> line 1. Malformed UTF-8 character (fatal) at ./train-recaser.perl line 70, <CORPUS> line 1. ----------------------------------------------------------------------- Please see full error logs attached for more information. Could anyone give me a hint on how to train a recasing model with either build-lm.sh or compile-lm? Help is very much appreciated. Thanks, Daniel
./train-recaser-irstlm.perl -train-script /home/nexoc/mosestools/scripts-20111024-1127/training/train-model.perl -corpus /home/nexoc/moses/work/corpus/cased.ilm.gz -dir /home/nexoc/moses/work/recaser -scripts-root-dir /home/nexoc/moses/mosestools/scripts-20111024-1127 (2) Train language model on cased data @ Sat Nov 12 15:11:22 CET 2011 /home/nexoc/moses/mosestools/irstlm-5.60.03/bin/compile-lm /home/nexoc/moses/work/corpus/cased.ilm.gz /dev/stdout | gzip -c > /home/nexoc/moses/work/recaser/cased.irstlm.gz inpfile: /home/nexoc/moses/work/corpus/cased.ilm.gz dub: 10000000 Language Model Type of /home/nexoc/moses/work/corpus/cased.ilm.gz is 1 Reading /home/nexoc/moses/work/corpus/cased.ilm.gz... iARPA loadtxt() 1-grams: reading 22785 entries 2-grams: reading 120301 entries 3-grams: reading 220243 entries done OOV code is 22784 OOV code is 22784 creating cache for storing prob, state and statesize of ngrams Saving in bin format to /dev/stdout savebin: /dev/stdout saving 22785 1-grams saving 120301 2-grams saving 220243 3-grams done deleting cache for storing prob, state and statesize of ngrams (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26 CET 2011 /home/nexoc/moses/work/recaser/aligned.lowercased utf8 "\x8B" does not map to Unicode at ./train-recaser-irstlm.perl line 64, <CORPUS> line 1. Malformed UTF-8 character (fatal) at ./train-recaser-irstlm.perl line 70, <CORPUS> line 1. creating for broken files: aligned.a, aligned.lowercased, aligned.cased and alinged.irstlm.gz in the directory /home/user/moses/work/recaser and a cased.ilm.lm file in the ROOT_SCRIPTS directory recaser.
./train-recaser-raw.perl -train-script /home/nexoc/mosestools/scripts-20111024-1127/training/train-model.perl -corpus /home/nexoc/moses/work/corpus/cased -dir /home/nexoc/moses/work/recaser -scripts-root-dir /home/nexoc/moses/mosestools/scripts-20111024-1127 (2) Train language model on cased data @ Sat Nov 12 14:46:36 CET 2011 /home/nexoc/moses/mosestools/irstlm-5.60.03/bin/build-lm.sh -t /tmp -i /home/nexoc/moses/work/corpus/cased -n 3 -o /home/nexoc/moses/work/recaser/cased.irstlm.gz Collecting 1-gram counts Computing n-gram probabilities: Collecting 1-gram counts Computing n-gram probabilities: Collecting 1-gram counts Computing n-gram probabilities: Cleaning temporary directory /tmp Extracting dictionary from training corpus Splitting dictionary into 3 lists Extracting n-gram statistics for each word list Important: dictionary must be ordered according to order of appearance of words in data used to generate n-gram blocks, so that sub language model blocks results ordered too dict.000 dict.001 dict.002 Estimating language models for each word list dict.000 dict.001 dict.002 Merging language models into /home/nexoc/moses/work/recaser/cased.irstlm.gz Cleaning temporary directory /tmp Removing temporary directory /tmp (3) Preparing data for training recasing model @ Sat Nov 12 14:49:05 CET 2011 /home/nexoc/moses/work/recaser/aligned.lowercased (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011 /home/nexoc/mosestools/scripts-20111024-1127/training/train-model.perl --root-dir /home/nexoc/moses/work/recaser --model-dir /home/nexoc/moses/work/recaser --first-step 4 --alignment a --corpus /home/nexoc/moses/work/recaser/aligned --f lowercased --e cased --max-phrase-length 1 --lm 0:3:/home/nexoc/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir /home/nexoc/moses/mosestools/scripts-20111024-1127 Can't exec "/home/nexoc/mosestools/scripts-20111024-1127/training/train-model.perl": No such file or directory at ./train-recaser-raw.perl line 95. (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011 result lm in iARPA format
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
