Hi Kenneth,

I ran iconv on my raw file and on the iARPA/ARPA files; encoding is ok, it
did not print any errors. build_binary neither echoed any errors.
But finally, I've found the issue causing the script to stop at line 95.

In addition to the suggested changes from
http://www.mail-archive.com/[email protected]/msg01934.html,

one need to change line 13 from
my $TRAIN_SCRIPT = " train-factored-phrase-model.perl";
to
my $TRAIN_SCRIPT = "/my/path/to/train-model.perl";

To conclude, using build_binary or build-lm.sh worked out fine.
However, If one would like to use compile-lm instead of build-lm, passing a
gzipped IARPA file, the train-recaser script still stops at line 64/70 due
to UTF8 issues. I'll asked the IRSTLM guys.

Thanks for your help! :)
Daniel

-----Ursprüngliche Nachricht-----
Von: Kenneth Heafield [mailto:[email protected]] 
Gesendet: Montag, 14. November 2011 16:05
An: Daniel Schaut
Betreff: Re: AW: [Moses-support] Train recasing model using IRSTLM

You can test if a file is UTF-8 using this command:

iconv -f utf8 -t utf8 <file_name >/dev/null

Does this succeed on your corpus, namely the file you're passing with
--corpus? Or does it print an error?

What's the error message that build_binary gives you? None of the error
messages you gave comes from build_binary.

On 11/14/11 14:40, Daniel Schaut wrote:
> Hi Kenneth,
>
> Thanks for your reply.
>
> I'm afraid I checked the iARPA file again, it's UTF8. Furthermore, I 
> deleted the first line of the file and tried it again, but without 
> success, same
> result:
> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, 
> <CORPUS>  line 1.
> Malformed UTF-8 character (fatal) at ./train-recaser.perl line 
> 70,<CORPUS> line 1.
>
> Further, I tried to call build_binary with an ARPA file, but still I 
> get the same error as if I run build-lm.sh
> (4) Training recasing model @ Mon Nov 14 12:49:06 CET 2011 Can't exec
> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl
> ": No such file or directory at ./train-recaser.perl line 95.
>
> Of course, I cleaned my files berforehand with clean-corpus-n and also 
> looked into train-recaser. Additionally, I changed the switch 
> $TRAIN_SCRIPT from "train-factored-phrase-model.perl" to
"train-model.perl" in line 13.
> Line 95 just echos the error/command (print STDERR '$cmd';). In my 
> folder "corpus", I've got files called "cased", "lowercased" and a LM 
> called "cased.ilm/arpa" depending on the command I use. 
> Train-model.perl remains in /scripts-20111024-1127/training. Even if I 
> move train-model.perl into /scripts-20111024-1127/recaser, the error line
95 persists.
>
> What did I miss? Which line or switch do I have to change, too?
>
> Best,
> Daniel
>
> -----Ursprüngliche Nachricht-----
> Von: [email protected] 
> [mailto:[email protected]] Im Auftrag von Kenneth Heafield
> Gesendet: Samstag, 12. November 2011 18:31
> An: [email protected]
> Betreff: Re: [Moses-support] Train recasing model using IRSTLM
>
> Hi,
>
>       It looks like your training data isn't valid UTF8.  Either convert
it 
> to UTF8 with iconv or scrub the invalid data first.
>
> Kenneth
>
> On 11/12/11 15:58, Daniel Schaut wrote:
>> Dear all,
>>
>>
>>
>> I’m having some difficulties to train the recasing model with IRSTLM.
>> I changed the train-recaser script according to
>>
>> http://www.mail-archive.com/[email protected]/msg01934.html
>>
>> but this results in an error which I don’t know how to fix.
>>
>>
>>
>> Error log:
>>
>> ---------------------------------------------------------------------
>> -
>> -
>>
>> (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011
>>
>> /home/user/mosestools/scripts-20111024-1127/training/train-model.perl
>> --root-dir /home/user/moses/work/recaser --model-dir 
>> /home/user/moses/work/recaser --first-step 4 --alignment a --corpus 
>> /home/user/moses/work/recaser/aligned --f lowercased --e cased 
>> --max-phrase-length 1 --lm
>> 0:3:/home/user/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir
>> /home/user/moses/mosestools/scripts-20111024-1127
>>
>> Can't exec
>> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl":
>> No such file or directory at ./train-recaser.perl line 95.
>>
>>
>>
>> (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011
>>
>> ---------------------------------------------------------------------
>> -
>> -
>>
>>
>>
>> Then instead of using build-lm.sh, I gave it another try calling 
>> compile-lm directly:
>>
>> my $cmd = "/home/user/moses/mosestools/irstlm-5.60.03/bin/compile-lm
>> $CORPUS /dev/stdout | gzip -c> $DIR/cased.irstlm.gz
>>
>> where $CORPUS is a gzip iARPA file.
>>
>>
>>
>> Error log:
>>
>> ---------------------------------------------------------------------
>> -
>> -
>>
>> (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26 
>> CET
>> 2011
>>
>> /home/nexoc/moses/work/recaser/aligned.lowercased
>>
>> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64, 
>> <CORPUS>  line 1.
>>
>> Malformed UTF-8 character (fatal) at ./train-recaser.perl line 70, 
>> <CORPUS>  line 1.
>>
>> ---------------------------------------------------------------------
>> -
>> -
>>
>>
>>
>> Please see full error logs attached for more information.
>>
>>
>>
>> Could anyone give me a hint on how to train a recasing model with 
>> either build-lm.sh or compile-lm? Help is very much appreciated.
>>
>>
>>
>> Thanks,
>>
>> Daniel
>>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to