Aha. This is mildly amusing. gzip's magic bytes are 0x1f 0x8b. That perl script is not prepared to accept gzipped files.
Kenneth On 11/15/11 10:24, Daniel Schaut wrote: > Hi Kenneth, > > I ran iconv on my raw file and on the iARPA/ARPA files; encoding is ok, it > did not print any errors. build_binary neither echoed any errors. > But finally, I've found the issue causing the script to stop at line 95. > > In addition to the suggested changes from > http://www.mail-archive.com/[email protected]/msg01934.html, > > one need to change line 13 from > my $TRAIN_SCRIPT = " train-factored-phrase-model.perl"; > to > my $TRAIN_SCRIPT = "/my/path/to/train-model.perl"; > > To conclude, using build_binary or build-lm.sh worked out fine. > However, If one would like to use compile-lm instead of build-lm, passing a > gzipped IARPA file, the train-recaser script still stops at line 64/70 due > to UTF8 issues. I'll asked the IRSTLM guys. > > Thanks for your help! :) > Daniel > > -----Ursprüngliche Nachricht----- > Von: Kenneth Heafield [mailto:[email protected]] > Gesendet: Montag, 14. November 2011 16:05 > An: Daniel Schaut > Betreff: Re: AW: [Moses-support] Train recasing model using IRSTLM > > You can test if a file is UTF-8 using this command: > > iconv -f utf8 -t utf8<file_name>/dev/null > > Does this succeed on your corpus, namely the file you're passing with > --corpus? Or does it print an error? > > What's the error message that build_binary gives you? None of the error > messages you gave comes from build_binary. > > On 11/14/11 14:40, Daniel Schaut wrote: >> Hi Kenneth, >> >> Thanks for your reply. >> >> I'm afraid I checked the iARPA file again, it's UTF8. Furthermore, I >> deleted the first line of the file and tried it again, but without >> success, same >> result: >> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, >> <CORPUS> line 1. >> Malformed UTF-8 character (fatal) at ./train-recaser.perl line >> 70,<CORPUS> line 1. >> >> Further, I tried to call build_binary with an ARPA file, but still I >> get the same error as if I run build-lm.sh >> (4) Training recasing model @ Mon Nov 14 12:49:06 CET 2011 Can't exec >> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl >> ": No such file or directory at ./train-recaser.perl line 95. >> >> Of course, I cleaned my files berforehand with clean-corpus-n and also >> looked into train-recaser. Additionally, I changed the switch >> $TRAIN_SCRIPT from "train-factored-phrase-model.perl" to > "train-model.perl" in line 13. >> Line 95 just echos the error/command (print STDERR '$cmd';). In my >> folder "corpus", I've got files called "cased", "lowercased" and a LM >> called "cased.ilm/arpa" depending on the command I use. >> Train-model.perl remains in /scripts-20111024-1127/training. Even if I >> move train-model.perl into /scripts-20111024-1127/recaser, the error line > 95 persists. >> What did I miss? Which line or switch do I have to change, too? >> >> Best, >> Daniel >> >> -----Ursprüngliche Nachricht----- >> Von: [email protected] >> [mailto:[email protected]] Im Auftrag von Kenneth Heafield >> Gesendet: Samstag, 12. November 2011 18:31 >> An: [email protected] >> Betreff: Re: [Moses-support] Train recasing model using IRSTLM >> >> Hi, >> >> It looks like your training data isn't valid UTF8. Either convert > it >> to UTF8 with iconv or scrub the invalid data first. >> >> Kenneth >> >> On 11/12/11 15:58, Daniel Schaut wrote: >>> Dear all, >>> >>> >>> >>> I’m having some difficulties to train the recasing model with IRSTLM. >>> I changed the train-recaser script according to >>> >>> http://www.mail-archive.com/[email protected]/msg01934.html >>> >>> but this results in an error which I don’t know how to fix. >>> >>> >>> >>> Error log: >>> >>> --------------------------------------------------------------------- >>> - >>> - >>> >>> (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011 >>> >>> /home/user/mosestools/scripts-20111024-1127/training/train-model.perl >>> --root-dir /home/user/moses/work/recaser --model-dir >>> /home/user/moses/work/recaser --first-step 4 --alignment a --corpus >>> /home/user/moses/work/recaser/aligned --f lowercased --e cased >>> --max-phrase-length 1 --lm >>> 0:3:/home/user/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir >>> /home/user/moses/mosestools/scripts-20111024-1127 >>> >>> Can't exec >>> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl": >>> No such file or directory at ./train-recaser.perl line 95. >>> >>> >>> >>> (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011 >>> >>> --------------------------------------------------------------------- >>> - >>> - >>> >>> >>> >>> Then instead of using build-lm.sh, I gave it another try calling >>> compile-lm directly: >>> >>> my $cmd = "/home/user/moses/mosestools/irstlm-5.60.03/bin/compile-lm >>> $CORPUS /dev/stdout | gzip -c> $DIR/cased.irstlm.gz >>> >>> where $CORPUS is a gzip iARPA file. >>> >>> >>> >>> Error log: >>> >>> --------------------------------------------------------------------- >>> - >>> - >>> >>> (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26 >>> CET >>> 2011 >>> >>> /home/nexoc/moses/work/recaser/aligned.lowercased >>> >>> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, >>> <CORPUS> line 1. >>> >>> Malformed UTF-8 character (fatal) at ./train-recaser.perl line 70, >>> <CORPUS> line 1. >>> >>> --------------------------------------------------------------------- >>> - >>> - >>> >>> >>> >>> Please see full error logs attached for more information. >>> >>> >>> >>> Could anyone give me a hint on how to train a recasing model with >>> either build-lm.sh or compile-lm? Help is very much appreciated. >>> >>> >>> >>> Thanks, >>> >>> Daniel >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
