Hi Kenneth, I ran iconv on my raw file and on the iARPA/ARPA files; encoding is ok, it did not print any errors. build_binary neither echoed any errors. But finally, I've found the issue causing the script to stop at line 95.
In addition to the suggested changes from http://www.mail-archive.com/[email protected]/msg01934.html, one need to change line 13 from my $TRAIN_SCRIPT = " train-factored-phrase-model.perl"; to my $TRAIN_SCRIPT = "/my/path/to/train-model.perl"; To conclude, using build_binary or build-lm.sh worked out fine. However, If one would like to use compile-lm instead of build-lm, passing a gzipped IARPA file, the train-recaser script still stops at line 64/70 due to UTF8 issues. I'll asked the IRSTLM guys. Thanks for your help! :) Daniel -----Ursprüngliche Nachricht----- Von: Kenneth Heafield [mailto:[email protected]] Gesendet: Montag, 14. November 2011 16:05 An: Daniel Schaut Betreff: Re: AW: [Moses-support] Train recasing model using IRSTLM You can test if a file is UTF-8 using this command: iconv -f utf8 -t utf8 <file_name >/dev/null Does this succeed on your corpus, namely the file you're passing with --corpus? Or does it print an error? What's the error message that build_binary gives you? None of the error messages you gave comes from build_binary. On 11/14/11 14:40, Daniel Schaut wrote: > Hi Kenneth, > > Thanks for your reply. > > I'm afraid I checked the iARPA file again, it's UTF8. Furthermore, I > deleted the first line of the file and tried it again, but without > success, same > result: > utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, > <CORPUS> line 1. > Malformed UTF-8 character (fatal) at ./train-recaser.perl line > 70,<CORPUS> line 1. > > Further, I tried to call build_binary with an ARPA file, but still I > get the same error as if I run build-lm.sh > (4) Training recasing model @ Mon Nov 14 12:49:06 CET 2011 Can't exec > "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl > ": No such file or directory at ./train-recaser.perl line 95. > > Of course, I cleaned my files berforehand with clean-corpus-n and also > looked into train-recaser. Additionally, I changed the switch > $TRAIN_SCRIPT from "train-factored-phrase-model.perl" to "train-model.perl" in line 13. > Line 95 just echos the error/command (print STDERR '$cmd';). In my > folder "corpus", I've got files called "cased", "lowercased" and a LM > called "cased.ilm/arpa" depending on the command I use. > Train-model.perl remains in /scripts-20111024-1127/training. Even if I > move train-model.perl into /scripts-20111024-1127/recaser, the error line 95 persists. > > What did I miss? Which line or switch do I have to change, too? > > Best, > Daniel > > -----Ursprüngliche Nachricht----- > Von: [email protected] > [mailto:[email protected]] Im Auftrag von Kenneth Heafield > Gesendet: Samstag, 12. November 2011 18:31 > An: [email protected] > Betreff: Re: [Moses-support] Train recasing model using IRSTLM > > Hi, > > It looks like your training data isn't valid UTF8. Either convert it > to UTF8 with iconv or scrub the invalid data first. > > Kenneth > > On 11/12/11 15:58, Daniel Schaut wrote: >> Dear all, >> >> >> >> Im having some difficulties to train the recasing model with IRSTLM. >> I changed the train-recaser script according to >> >> http://www.mail-archive.com/[email protected]/msg01934.html >> >> but this results in an error which I dont know how to fix. >> >> >> >> Error log: >> >> --------------------------------------------------------------------- >> - >> - >> >> (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011 >> >> /home/user/mosestools/scripts-20111024-1127/training/train-model.perl >> --root-dir /home/user/moses/work/recaser --model-dir >> /home/user/moses/work/recaser --first-step 4 --alignment a --corpus >> /home/user/moses/work/recaser/aligned --f lowercased --e cased >> --max-phrase-length 1 --lm >> 0:3:/home/user/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir >> /home/user/moses/mosestools/scripts-20111024-1127 >> >> Can't exec >> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl": >> No such file or directory at ./train-recaser.perl line 95. >> >> >> >> (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011 >> >> --------------------------------------------------------------------- >> - >> - >> >> >> >> Then instead of using build-lm.sh, I gave it another try calling >> compile-lm directly: >> >> my $cmd = "/home/user/moses/mosestools/irstlm-5.60.03/bin/compile-lm >> $CORPUS /dev/stdout | gzip -c> $DIR/cased.irstlm.gz >> >> where $CORPUS is a gzip iARPA file. >> >> >> >> Error log: >> >> --------------------------------------------------------------------- >> - >> - >> >> (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26 >> CET >> 2011 >> >> /home/nexoc/moses/work/recaser/aligned.lowercased >> >> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, >> <CORPUS> line 1. >> >> Malformed UTF-8 character (fatal) at ./train-recaser.perl line 70, >> <CORPUS> line 1. >> >> --------------------------------------------------------------------- >> - >> - >> >> >> >> Please see full error logs attached for more information. >> >> >> >> Could anyone give me a hint on how to train a recasing model with >> either build-lm.sh or compile-lm? Help is very much appreciated. >> >> >> >> Thanks, >> >> Daniel >> >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
