Aha. This is mildly amusing. gzip's magic bytes are 0x1f 0x8b. That perl 
script is not prepared to accept gzipped files.

Kenneth

On 11/15/11 10:24, Daniel Schaut wrote:
> Hi Kenneth,
>
> I ran iconv on my raw file and on the iARPA/ARPA files; encoding is ok, it
> did not print any errors. build_binary neither echoed any errors.
> But finally, I've found the issue causing the script to stop at line 95.
>
> In addition to the suggested changes from
> http://www.mail-archive.com/[email protected]/msg01934.html,
>
> one need to change line 13 from
> my $TRAIN_SCRIPT = " train-factored-phrase-model.perl";
> to
> my $TRAIN_SCRIPT = "/my/path/to/train-model.perl";
>
> To conclude, using build_binary or build-lm.sh worked out fine.
> However, If one would like to use compile-lm instead of build-lm, passing a
> gzipped IARPA file, the train-recaser script still stops at line 64/70 due
> to UTF8 issues. I'll asked the IRSTLM guys.
>
> Thanks for your help! :)
> Daniel
>
> -----Ursprüngliche Nachricht-----
> Von: Kenneth Heafield [mailto:[email protected]]
> Gesendet: Montag, 14. November 2011 16:05
> An: Daniel Schaut
> Betreff: Re: AW: [Moses-support] Train recasing model using IRSTLM
>
> You can test if a file is UTF-8 using this command:
>
> iconv -f utf8 -t utf8<file_name>/dev/null
>
> Does this succeed on your corpus, namely the file you're passing with
> --corpus? Or does it print an error?
>
> What's the error message that build_binary gives you? None of the error
> messages you gave comes from build_binary.
>
> On 11/14/11 14:40, Daniel Schaut wrote:
>> Hi Kenneth,
>>
>> Thanks for your reply.
>>
>> I'm afraid I checked the iARPA file again, it's UTF8. Furthermore, I
>> deleted the first line of the file and tried it again, but without
>> success, same
>> result:
>> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64,
>> <CORPUS>   line 1.
>> Malformed UTF-8 character (fatal) at ./train-recaser.perl line
>> 70,<CORPUS>  line 1.
>>
>> Further, I tried to call build_binary with an ARPA file, but still I
>> get the same error as if I run build-lm.sh
>> (4) Training recasing model @ Mon Nov 14 12:49:06 CET 2011 Can't exec
>> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl
>> ": No such file or directory at ./train-recaser.perl line 95.
>>
>> Of course, I cleaned my files berforehand with clean-corpus-n and also
>> looked into train-recaser. Additionally, I changed the switch
>> $TRAIN_SCRIPT from "train-factored-phrase-model.perl" to
> "train-model.perl" in line 13.
>> Line 95 just echos the error/command (print STDERR '$cmd';). In my
>> folder "corpus", I've got files called "cased", "lowercased" and a LM
>> called "cased.ilm/arpa" depending on the command I use.
>> Train-model.perl remains in /scripts-20111024-1127/training. Even if I
>> move train-model.perl into /scripts-20111024-1127/recaser, the error line
> 95 persists.
>> What did I miss? Which line or switch do I have to change, too?
>>
>> Best,
>> Daniel
>>
>> -----Ursprüngliche Nachricht-----
>> Von: [email protected]
>> [mailto:[email protected]] Im Auftrag von Kenneth Heafield
>> Gesendet: Samstag, 12. November 2011 18:31
>> An: [email protected]
>> Betreff: Re: [Moses-support] Train recasing model using IRSTLM
>>
>> Hi,
>>
>>      It looks like your training data isn't valid UTF8.  Either convert
> it
>> to UTF8 with iconv or scrub the invalid data first.
>>
>> Kenneth
>>
>> On 11/12/11 15:58, Daniel Schaut wrote:
>>> Dear all,
>>>
>>>
>>>
>>> I’m having some difficulties to train the recasing model with IRSTLM.
>>> I changed the train-recaser script according to
>>>
>>> http://www.mail-archive.com/[email protected]/msg01934.html
>>>
>>> but this results in an error which I don’t know how to fix.
>>>
>>>
>>>
>>> Error log:
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> -
>>>
>>> (4) Training recasing model @ Sat Nov 12 14:49:06 CET 2011
>>>
>>> /home/user/mosestools/scripts-20111024-1127/training/train-model.perl
>>> --root-dir /home/user/moses/work/recaser --model-dir
>>> /home/user/moses/work/recaser --first-step 4 --alignment a --corpus
>>> /home/user/moses/work/recaser/aligned --f lowercased --e cased
>>> --max-phrase-length 1 --lm
>>> 0:3:/home/user/moses/work/recaser/cased.irstlm.gz:1 -scripts-root-dir
>>> /home/user/moses/mosestools/scripts-20111024-1127
>>>
>>> Can't exec
>>> "/home/user/mosestools/scripts-20111024-1127/training/train-model.perl":
>>> No such file or directory at ./train-recaser.perl line 95.
>>>
>>>
>>>
>>> (11) Cleaning up @ Sat Nov 12 14:49:06 CET 2011
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> -
>>>
>>>
>>>
>>> Then instead of using build-lm.sh, I gave it another try calling
>>> compile-lm directly:
>>>
>>> my $cmd = "/home/user/moses/mosestools/irstlm-5.60.03/bin/compile-lm
>>> $CORPUS /dev/stdout | gzip -c>  $DIR/cased.irstlm.gz
>>>
>>> where $CORPUS is a gzip iARPA file.
>>>
>>>
>>>
>>> Error log:
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> -
>>>
>>> (3) Preparing data for training recasing model @ Sat Nov 12 15:11:26
>>> CET
>>> 2011
>>>
>>> /home/nexoc/moses/work/recaser/aligned.lowercased
>>>
>>> utf8 "\x8B" does not map to Unicode at ./train-recaser.perl line 64,
>>> <CORPUS>   line 1.
>>>
>>> Malformed UTF-8 character (fatal) at ./train-recaser.perl line 70,
>>> <CORPUS>   line 1.
>>>
>>> ---------------------------------------------------------------------
>>> -
>>> -
>>>
>>>
>>>
>>> Please see full error logs attached for more information.
>>>
>>>
>>>
>>> Could anyone give me a hint on how to train a recasing model with
>>> either build-lm.sh or compile-lm? Help is very much appreciated.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Daniel
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to