Hi Tom, If this script is intended exactly and only to generate sgm test/dev files from txt file then yes it needs to be amended.
1) line breakers except 0A need to be removed prior to the python execution (byte stream replace) 2) even though XML standard is to replace ' by ' and so on for others I have noticed that all test/dev sets do not include the xml codes like ' so waht I did I removed the second string replace in your code. however I added 2 others replaces in the first sequence : => " " and   => " " 3) even though this is standard for XML I removed the first 3 lines for the doc XML DOCTYPE and MTEVAL also the last one MTEVAL all of this to stick to the expected file for test sets. If you have the chance, you could add 2 options : - nb = nb of lines you want to take from the file - selection = either nb first lines or random in the txt file I am just wondering if there is not another perl script developped by someone. how were the sets generated to start with ? cheers, Vincent Le 14/09/2015 04:57, Tom Hoar a écrit : > Thanks Vincent, > > Good catch about Python's Unicode processing. This script uses Python's > `codecs` library, which treats characters according to their Unicode > definitions. So, the function fh.splitlines() splits the string into a > list as expected with traditional ASCII cr/lf sequences. In addition, > however, it also splits on three Unicode characters. They are: > > \u2028 or \xe2\x80\xa8 - line separator; LSEP > \u2029 or \xe2\x80\xa9 - paragraph separator; PSEP > \u2063 or \xe2\x81\xa3 - invisible separator; ISEP > > We discovered this after contributing this script to Moses. In our > experience, Asian-language text editors more often create these are > characters, and European editors typically don't. This means you can end > up with a line count mis-match between the two languages. > > Do you think we should update t this script, or should users be > responsible for how they handle these cases? > > > > On 9/13/2015 11:01 PM, [email protected] wrote: >> Date: Sun, 13 Sep 2015 10:44:02 +0200 >> From: Vincent Nguyen<[email protected]> >> Subject: Re: [Moses-support] sgm generation for personalized test sets >> To: moses-support<[email protected]> >> Message-ID:<[email protected]> >> Content-Type: text/plain; charset=windows-1252; format=flowed >> >> >> in order to use makemteval.py we need to remove 0D and E2 80 A8 from txt >> files. >> python handles them as additional line breakers. >> >> Le 12/09/2015 22:07, Vincent Nguyen a ?crit : >>>> Hi, >>>> >>>> What script do you guys use to generate sgm sets based on txt file ? >>>> >>>> I have tried makemteval.py in contrib >>>> but there are a few issues. >>>> >>>> I think these lines: >>>> lines = >>>> [l.replace('"','\"').replace(''','\'').replace('>','>').replace('<','<').replace('&','&') >>>> for l in filein.read().splitlines()] >>>> filein.close() >>>> lines = >>>> [l.replace('&','&').replace('<','<').replace('>','>').replace('\'',''').replace('\"','"') >>>> for l in lines] >>>> >>>> are not 100% bullet proof. >>>> >>>> in the output I still get ' and such >>>> it does not handle the >>>> it does not handle the \r\n sequence I think since the output has more >>>> lines than in the txt file. >>>> >>>> Maybe there is another script. >>>> >>>> thanks. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
