Re: [Moses-support] sgm generation for personalized test sets

Tom Hoar Sun, 13 Sep 2015 21:00:07 -0700

Thanks Vincent,

Good catch about Python's Unicode processing. This script uses Python's 
`codecs` library, which treats characters according to their Unicode 
definitions. So, the function fh.splitlines() splits the string into a 
list as expected with traditional ASCII cr/lf sequences. In addition, 
however, it also splits on three Unicode characters. They are:


     \u2028 or \xe2\x80\xa8  - line separator; LSEP
     \u2029 or \xe2\x80\xa9  - paragraph separator; PSEP
     \u2063 or \xe2\x81\xa3  - invisible separator; ISEP

We discovered this after contributing this script to Moses. In our 
experience, Asian-language text editors more often create these are 
characters, and European editors typically don't. This means you can end 
up with a line count mis-match between the two languages.

Do  you think we should update t this script, or should users be 
responsible for how they handle these cases?



On 9/13/2015 11:01 PM, [email protected] wrote:
> Date: Sun, 13 Sep 2015 10:44:02 +0200
> From: Vincent Nguyen<[email protected]>
> Subject: Re: [Moses-support] sgm generation for personalized test sets
> To: moses-support<[email protected]>
> Message-ID:<[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
>
> in order to use makemteval.py we need to remove 0D and E2 80 A8 from txt
> files.
> python handles them as additional line breakers.
>
> Le 12/09/2015 22:07, Vincent Nguyen a ?crit :
>> >Hi,
>> >
>> >What script do you guys use to generate sgm sets based on txt file ?
>> >
>> >I have tried makemteval.py in contrib
>> >but there are a few issues.
>> >
>> >I think these lines:
>> >lines =
>> >[l.replace('&quot;','\"').replace('&apos;','\'').replace('&gt;','>').replace('&lt;','<').replace('&amp;','&')
>> >for l in filein.read().splitlines()]
>> >filein.close()
>> >lines =
>> >[l.replace('&','&amp;').replace('<','&lt;').replace('>','&gt;').replace('\'','&apos;').replace('\"','&quot;')
>> >for l in lines]
>> >
>> >are not 100% bullet proof.
>> >
>> >in the output I still get &apos; and such
>> >it does not handle the &nbsp;
>> >it does not handle the \r\n sequence I think since the output has more
>> >lines than in the txt file.
>> >
>> >Maybe there is another script.
>> >
>> >thanks.
>> >
>> >
>> >
>> >_______________________________________________
>> >Moses-support mailing list
>> >[email protected]
>> >http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] sgm generation for personalized test sets

Reply via email to