on the subject of UTF8, i think the Moses tokeniser may be using the version that is too strict.
i've just changed it to this: > binmode(STDIN, ":encoding(UTF-8)"); binmode(STDOUT, ":encoding(UTF-8)"); > and later on in the same file,: > open(PREFIX, "<::encoding(UTF-8)", "$prefixfile"); > see if this helps. Miles On 27 June 2010 13:15, Ingrid Falk <[email protected]> wrote: > Hi Cyrine, > > I think this is because tokenizer.perl expects utf-8 input (on STDIN). > > This is because of the binmode(STDIN, ':utf8'); line in the tokenizer > script. > > Your input is maybe not utf-8? > > Ingrid > > On 06/27/2010 01:08 PM, Cyrine NASRI wrote: >> >> Hello everyone, >> I try to run the script for my two tokenizer.perl development file. >> I'm having a problem when running, but I do not understand why. >> A message appears: >> >> /home/Bureau/moses/moses/scripts/tokenizer$ ./tokenizer.perl -l fr < >> /home/Bureau/work/test-fr.fr <http://test-fr.fr> > >> /home/Bureau/work/input.tok >> Tokenizer Version 1.0 >> Language: fr >> WARNING: No known abbreviations for language 'fr', attempting fall-back >> to English version... >> utf8 "\xE9" does not map to Unicode at ./tokenizer.perl line 47, <STDIN> >> line 1. >> Malformed UTF-8 character (fatal) at ./tokenizer.perl line 67, <STDIN> >> line 1. >> >> Thank you very much. >> >> Sincerely >> Cyrine >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
