Japanese is another language that suffers from standard Unicode NFKC 
because the normalization applies changes that can not be reversed.



On 12/30/2014 04:40 AM, John D Burger wrote:
>> This is also a reason to turn Unicode normalization on.  If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
> If I understand the situation correctly, this would only fix this particular 
> example and a few others like it. There are many base+combining grapheme 
> clusters in Unicode text which cannot be normalized to a single pre-composed 
> character. Vietnamese comes to mind.
>
> - JB
>
> On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote:
>
>> Dear Moses,
>>
>>      The attached file, taken from line 2345157 of
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
>> , tokenizes differently on different machines.
>>
>>      I'm running tokenizer.perl from head (481a07dc) with this perl:
>>
>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>> x86_64-linux-thread-multi
>> (with 25 registered patches, see perl -V for more detail)
>>
>> perl -V is attached from newer machines.
>>
>>      The input is "Jürgen" with a specific encoding:
>>
>> uconv -f utf-8 -x any-name jur
>>
>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
>> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>
>> So the umlaut is encoded as a normal "u" character followed by a
>> combining diaeresis marker.  This encoding is legal, but it differs from
>> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
>> DIAERESIS}.
>>
>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS} is a single character and recognizing it as part of the
>> IsAlnum class.  Tokenizing on these machines outputs
>>
>> Jürgen
>>
>> Newer machines are treating them separately, recognizing \N{COMBINING
>> DIAERESIS} as a separate character that is not part of IsAlnum.  The
>> Moses tokenizer then treats it as something to split off, yielding this
>> tokenization:
>>
>> Ju ̈ rgen
>>
>> I thought it might be locale-related but IsAlnum is supposed to be
>> locale-agnostic.  I couldn't come up with environment variables that
>> made the new machines tokenize as a single word.
>>
>> Maybe this is a perl bug, but the result is that two different machines
>> running the same perl script produce different tokenization :-(.
>>
>> This is also a reason to turn Unicode normalization on.  If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
>>
>> Kenneth
>>
>> <jur.gz><perl_V.txt>_______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to