Thanks Vincent,
Good catch about Python's Unicode processing. This script uses Python's
`codecs` library, which treats characters according to their Unicode
definitions. So, the function fh.splitlines() splits the string into a
list as expected with traditional ASCII cr/lf sequences. In addition,
however, it also splits on three Unicode characters. They are:
\u2028 or \xe2\x80\xa8 - line separator; LSEP
\u2029 or \xe2\x80\xa9 - paragraph separator; PSEP
\u2063 or \xe2\x81\xa3 - invisible separator; ISEP
We discovered this after contributing this script to Moses. In our
experience, Asian-language text editors more often create these are
characters, and European editors typically don't. This means you can end
up with a line count mis-match between the two languages.
Do you think we should update t this script, or should users be
responsible for how they handle these cases?
On 9/13/2015 11:01 PM, [email protected] wrote:
> Date: Sun, 13 Sep 2015 10:44:02 +0200
> From: Vincent Nguyen<[email protected]>
> Subject: Re: [Moses-support] sgm generation for personalized test sets
> To: moses-support<[email protected]>
> Message-ID:<[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
>
> in order to use makemteval.py we need to remove 0D and E2 80 A8 from txt
> files.
> python handles them as additional line breakers.
>
> Le 12/09/2015 22:07, Vincent Nguyen a ?crit :
>> >Hi,
>> >
>> >What script do you guys use to generate sgm sets based on txt file ?
>> >
>> >I have tried makemteval.py in contrib
>> >but there are a few issues.
>> >
>> >I think these lines:
>> >lines =
>> >[l.replace('"','\"').replace(''','\'').replace('>','>').replace('<','<').replace('&','&')
>> >for l in filein.read().splitlines()]
>> >filein.close()
>> >lines =
>> >[l.replace('&','&').replace('<','<').replace('>','>').replace('\'',''').replace('\"','"')
>> >for l in lines]
>> >
>> >are not 100% bullet proof.
>> >
>> >in the output I still get ' and such
>> >it does not handle the
>> >it does not handle the \r\n sequence I think since the output has more
>> >lines than in the txt file.
>> >
>> >Maybe there is another script.
>> >
>> >thanks.
>> >
>> >
>> >
>> >_______________________________________________
>> >Moses-support mailing list
>> >[email protected]
>> >http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support