Merry Christmas everyone. 

Thanks, Hieu. No, your suggestion is not
a problem. Documenting the limitation and trapping the front-end is a
viable solution. 

We found the problem when a customer reported our
code improperly handled ASCII vs UTF-8 with European accented
characters. I told the staff to test our fixes with a worst-case
scenario. They chose Thai paths. Nice, huh? Since then, we fell back to
"easier" European accented characters, Chinese and Japanese. All of the
non-Thai characters seem to work fine. We can only replicate the error
with Thai. So, this seems to be a bug in Perl and its handling of Thai
characters with the system() call. 

This troubleshooting exercise
reveals some additional challenges that we shared with our MS Windows
team. Right now, that team is documenting the dependencies in
train-model.perl. Can you or your team share any documentation of the
dependencies? 

Thanks,
Tom 

On 2012-12-25 06:23, Hieu Hoang wrote: 

>
hi tom
> 
> in an ideal world, non-ascii characters (and spaces and misc
other characters) won't be a problem. Unfortunately, the scripts aren't
tested very often for those cases and it's too difficult to enforce
scripts to work for anything but ascii paths. Especially as it's spread
over Moses and Mgiza scripts.
> 
> you're probably better off
constraining your user front-end likewise. Is that a problem for you?
>

> merry xmas
> hieu
> 
> On 24/12/2012 09:44, Tom Hoar wrote: 
> 
>>
I've traced a problem in train-model.perl but don't know how to fix it.
I'm using Moses 0.91 and the error occurs when the calling
merge_alignment.py. 
>> 
>> Line 1988, system(@_);, fails when the
output path contains some extended (Thai) UTF-8 characters. 
>> 
>> The
log output shows: 
>> 
>> Executing: /home/tahoar/bin/merge_alignment.py
/home/tahoar/share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ
-ไทย/giza.อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final.part*>/home/tahoar
/share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ-ไทย/giza.
อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final
>> sh: cannot create
/home/tahoar/share/domy/TRAININGS/alignments/align-ไมโคร_tm-อัง
��ฤษ-ไทย/giza.อัง��ฤษ-ไทย/อัง��ฤษ-ไทย.A3.final: Directory nonexistent

>> 
>> Contrary to the log error message, the correct output directory
exists. Three things to note: 
>> 
>> 1) The corrupted UTF-8 characters
above are in the log echoed to the terminal, they're not a bad email 
>>

>> 2) I can run the "Executing: xxx" line from the terminal and it
works fine 
>> 
>> 3) I patched merge_alignment.py to save the sys.argv
list to a text file just after the test for command arguments. The file
never gets created. So, merge_alignment.py is never executed with the
Perl "system" call. 
>> 
>> I attached two proposed changes that I used
to resolve the problem. I updated merge_alignment.py so the first
argument is the output file name and all remaining arguments are input
files. The new merge_alignment.py uses glob to support wildcards in the
input file names, and it sends output to the file instead of stdout. The
second change is train-model.perl to match the command line changes to
merge_alignment.py. 
>> 
>> Unfortunately, this only fixes the system
call to merge_alignment.py call. There are many other system calls that
redirect the output, and each of them show the same problem of
corrupting the UTF-8 output path. 
>> 
>> Any suggestions? 
>> 
>> Tom

>> 
>> _______________________________________________
>> Moses-support
mailing list
>> [email protected]
>>
http://mailman.mit.edu/mailman/listinfo/moses-support [1]
> 
>
_______________________________________________
> Moses-support mailing
list
> [email protected]
>
http://mailman.mit.edu/mailman/listinfo/moses-support [1]



Links:
------
[1]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to