hi tom

in an ideal world, non-ascii characters (and spaces and misc other characters) won't be a problem. Unfortunately, the scripts aren't tested very often for those cases and it's too difficult to enforce scripts to work for anything but ascii paths. Especially as it's spread over Moses and Mgiza scripts.

you're probably better off constraining your user front-end likewise. Is that a problem for you?

merry xmas
hieu

On 24/12/2012 09:44, Tom Hoar wrote:

I've traced a problem in train-model.perl but don't know how to fix it. I'm using Moses 0.91 and the error occurs when the calling merge_alignment.py.

Line 1988, system(@_);, fails when the output path contains some extended (Thai) UTF-8 characters.

The log output shows:

Executing: /home/tahoar/bin/merge_alignment.py /home/tahoar/share/domy/TRAININGS/alignments/align-?????_tm-?????? -???/giza.??????-???/??????-???.A3.final.part*>/home/tahoar /share/domy/TRAININGS/alignments/align-?????_tm-??????-???/giza. ??????-???/??????-???.A3.final sh: cannot create /home/tahoar/share/domy/TRAININGS/alignments/align-?????_tm-??? ????-???/giza.???????-???/???????-???.A3.final: Directory nonexistent

Contrary to the log error message, the correct output directory exists. Three things to note:

1) The corrupted UTF-8 characters above are in the log echoed to the terminal, they're not a bad email

2) I can run the "Executing: xxx" line from the terminal and it works fine

3) I patched merge_alignment.py to save the sys.argv list to a text file just after the test for command arguments. The file never gets created. So, merge_alignment.py is never executed with the Perl "system" call.

I attached two proposed changes that I used to resolve the problem. I updated merge_alignment.py so the first argument is the output file name and all remaining arguments are input files. The new merge_alignment.py uses glob to support wildcards in the input file names, and it sends output to the file instead of stdout. The second change is train-model.perl to match the command line changes to merge_alignment.py.

Unfortunately, this only fixes the system call to merge_alignment.py call. There are many other system calls that redirect the output, and each of them show the same problem of corrupting the UTF-8 output path.

Any suggestions?

Tom



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to