+Qin Gao Is this a train-model.perl problem or an mgiza problem?
On 12/27/12 23:04, Hieu Hoang wrote: > Hi Tom, > > We don't really keep documentation on dependencies. We just try not to > add dependencies until it's really needed. I only know the usual suspects: > boost > perl > python > gcc > And lots of optional libraries eg. irstlm, srilm, tcmalloc... > > i don't know the exact versions of each. It's subject to change anyway, > depending on added functionality and how much people complain. > > You're probably in a better position to know about the exact > dependencies since you have customers bending your ear about them. > > > On 25/12/2012 01:52, Tom Hoar wrote: >> >> Merry Christmas everyone. >> >> Thanks, Hieu. No, your suggestion is not a problem. Documenting the >> limitation and trapping the front-end is a viable solution. >> >> We found the problem when a customer reported our code improperly >> handled ASCII vs UTF-8 with European accented characters. I told the >> staff to test our fixes with a worst-case scenario. They chose Thai >> paths. Nice, huh? Since then, we fell back to "easier" European >> accented characters, Chinese and Japanese. All of the non-Thai >> characters seem to work fine. We can only replicate the error with >> Thai. So, this seems to be a bug in Perl and its handling of Thai >> characters with the system() call. >> >> This troubleshooting exercise reveals some additional challenges that >> we shared with our MS Windows team. Right now, that team is >> documenting the dependencies in train-model.perl. Can you or your team >> share any documentation of the dependencies? >> >> Thanks, >> Tom >> >> On 2012-12-25 06:23, Hieu Hoang wrote: >> >>> hi tom >>> >>> in an ideal world, non-ascii characters (and spaces and misc other >>> characters) won't be a problem. Unfortunately, the scripts aren't >>> tested very often for those cases and it's too difficult to enforce >>> scripts to work for anything but ascii paths. Especially as it's >>> spread over Moses and Mgiza scripts. >>> >>> you're probably better off constraining your user front-end likewise. >>> Is that a problem for you? >>> >>> merry xmas >>> hieu >>> >>> On 24/12/2012 09:44, Tom Hoar wrote: >>>> >>>> I've traced a problem in train-model.perl but don't know how to fix >>>> it. I'm using Moses 0.91 and the error occurs when the calling >>>> merge_alignment.py. >>>> >>>> Line 1988, system(@_);, fails when the output path contains some >>>> extended (Thai) UTF-8 characters. >>>> >>>> The log output shows: >>>> >>>> Executing: /home/tahoar/bin/merge_alignment.py >>>> /home/tahoar/share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ >>>> -ไทย/giza.อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final.part*> /home/tahoar >>>> /share/domy/TRAININGS/alignments/align-ไมโคร_tm-อังกฤษ-ไทย/giza. >>>> อังกฤษ-ไทย/อังกฤษ-ไทย.A3.final >>>> sh: cannot create /home/tahoar/share/domy/TRAININGS/alignments >>>> /align-ไมโคร_tm-อัง ��ฤษ-ไทย/giza.อัง��ฤษ-ไทย/อัง��ฤษ-ไทย.A3.final: >>>> Directory nonexistent >>>> >>>> Contrary to the log error message, the correct output directory >>>> exists. Three things to note: >>>> >>>> 1) The corrupted UTF-8 characters above are in the log echoed to the >>>> terminal, they're not a bad email >>>> >>>> 2) I can run the "Executing: xxx" line from the terminal and it >>>> works fine >>>> >>>> 3) I patched merge_alignment.py to save the sys.argv list to a text >>>> file just after the test for command arguments. The file never gets >>>> created. So, merge_alignment.py is never executed with the Perl >>>> "system" call. >>>> >>>> I attached two proposed changes that I used to resolve the problem. >>>> I updated merge_alignment.py so the first argument is the output >>>> file name and all remaining arguments are input files. The new >>>> merge_alignment.py uses glob to support wildcards in the input file >>>> names, and it sends output to the file instead of stdout. The second >>>> change is train-model.perl to match the command line changes to >>>> merge_alignment.py. >>>> >>>> Unfortunately, this only fixes the system call to merge_alignment.py >>>> call. There are many other system calls that redirect the output, >>>> and each of them show the same problem of corrupting the UTF-8 >>>> output path. >>>> >>>> Any suggestions? >>>> >>>> Tom >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] <mailto:[email protected]> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
