Re: [Moses-support] train-model.perl with mgizapp fails when extended UTF-8 characters in output path.

Kenneth Heafield Thu, 27 Dec 2012 15:41:56 -0800

+Qin Gao

Is this a train-model.perl problem or an mgiza problem?


On 12/27/12 23:04, Hieu Hoang wrote:
> Hi Tom,
>
> We don't really keep documentation on dependencies. We just try not to
> add dependencies until it's really needed. I only know the usual suspects:
> boost
> perl
> python
> gcc
> And lots of optional libraries eg. irstlm, srilm, tcmalloc...
>
> i don't know the exact versions of each. It's subject to change anyway,
> depending on added functionality and how much people complain.
>
> You're probably in a better position to know about the exact
> dependencies since you have customers bending your ear about them.
>
>
> On 25/12/2012 01:52, Tom Hoar wrote:
>>
>> Merry Christmas everyone.
>>
>> Thanks, Hieu. No, your suggestion is not a problem. Documenting the
>> limitation and trapping the front-end is a viable solution.
>>
>> We found the problem when a customer reported our code improperly
>> handled ASCII vs UTF-8 with European accented characters. I told the
>> staff to test our fixes with a worst-case scenario. They chose Thai
>> paths. Nice, huh? Since then, we fell back to "easier" European
>> accented characters, Chinese and Japanese. All of the non-Thai
>> characters seem to work fine. We can only replicate the error with
>> Thai. So, this seems to be a bug in Perl and its handling of Thai
>> characters with the system() call.
>>
>> This troubleshooting exercise reveals some additional challenges that
>> we shared with our MS Windows team. Right now, that team is
>> documenting the dependencies in train-model.perl. Can you or your team
>> share any documentation of the dependencies?
>>
>> Thanks,
>> Tom
>>
>> On 2012-12-25 06:23, Hieu Hoang wrote:
>>
>>> hi tom
>>>
>>> in an ideal world, non-ascii characters (and spaces and misc other
>>> characters) won't be a problem. Unfortunately, the scripts aren't
>>> tested very often for those cases and it's too difficult to enforce
>>> scripts to work for anything but ascii paths. Especially as it's
>>> spread over Moses and Mgiza scripts.
>>>
>>> you're probably better off constraining your user front-end likewise.
>>> Is that a problem for you?
>>>
>>> merry xmas
>>> hieu
>>>
>>> On 24/12/2012 09:44, Tom Hoar wrote:
>>>>
>>>> I've traced a problem in train-model.perl but don't know how to fix
>>>> it. I'm using Moses 0.91 and the error occurs when the calling
>>>> merge_alignment.py.
>>>>
>>>> Line 1988, system(@_);, fails when the output path contains some
>>>> extended (Thai) UTF-8 characters.
>>>>
>>>> The log output shows:
>>>>
>>>> Executing: /home/tahoar/bin/merge_alignment.py
>>>> /home/tahoar/share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ
>>>> -ไทย/giza.อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final.part*> /home/tahoar
>>>> /share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ-ไทย/giza.
>>>> อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final
>>>> sh: cannot create /home/tahoar/share/domy/TRAININGS/alignments
>>>> /align-ไมโคร_tm-อัง ��ฤษ-ไทย/giza.อัง��ฤษ-ไทย/อัง��ฤษ-ไทย.A3.final:
>>>> Directory nonexistent
>>>>
>>>> Contrary to the log error message, the correct output directory
>>>> exists. Three things to note:
>>>>
>>>> 1) The corrupted UTF-8 characters above are in the log echoed to the
>>>> terminal, they're not a bad email
>>>>
>>>> 2) I can run the "Executing: xxx" line from the terminal and it
>>>> works fine
>>>>
>>>> 3) I patched merge_alignment.py to save the sys.argv list to a text
>>>> file just after the test for command arguments. The file never gets
>>>> created. So, merge_alignment.py is never executed with the Perl
>>>> "system" call.
>>>>
>>>> I attached two proposed changes that I used to resolve the problem.
>>>> I updated merge_alignment.py so the first argument is the output
>>>> file name and all remaining arguments are input files. The new
>>>> merge_alignment.py uses glob to support wildcards in the input file
>>>> names, and it sends output to the file instead of stdout. The second
>>>> change is train-model.perl to match the command line changes to
>>>> merge_alignment.py.
>>>>
>>>> Unfortunately, this only fixes the system call to merge_alignment.py
>>>> call. There are many other system calls that redirect the output,
>>>> and each of them show the same problem of corrupting the UTF-8
>>>> output path.
>>>>
>>>> Any suggestions?
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]  <mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] train-model.perl with mgizapp fails when extended UTF-8 characters in output path.

Reply via email to