Japanese is another language that suffers from standard Unicode NFKC because the normalization applies changes that can not be reversed.
On 12/30/2014 04:40 AM, John D Burger wrote: >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. > If I understand the situation correctly, this would only fix this particular > example and a few others like it. There are many base+combining grapheme > clusters in Unicode text which cannot be normalized to a single pre-composed > character. Vietnamese comes to mind. > > - JB > > On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote: > >> Dear Moses, >> >> The attached file, taken from line 2345157 of >> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz >> , tokenizes differently on different machines. >> >> I'm running tokenizer.perl from head (481a07dc) with this perl: >> >> This is perl 5, version 18, subversion 2 (v5.18.2) built for >> x86_64-linux-thread-multi >> (with 25 registered patches, see perl -V for more detail) >> >> perl -V is attached from newer machines. >> >> The input is "Jürgen" with a specific encoding: >> >> uconv -f utf-8 -x any-name jur >> >> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL >> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} >> >> So the umlaut is encoded as a normal "u" character followed by a >> combining diaeresis marker. This encoding is legal, but it differs from >> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH >> DIAERESIS}. >> >> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS} is a single character and recognizing it as part of the >> IsAlnum class. Tokenizing on these machines outputs >> >> Jürgen >> >> Newer machines are treating them separately, recognizing \N{COMBINING >> DIAERESIS} as a separate character that is not part of IsAlnum. The >> Moses tokenizer then treats it as something to split off, yielding this >> tokenization: >> >> Ju ̈ rgen >> >> I thought it might be locale-related but IsAlnum is supposed to be >> locale-agnostic. I couldn't come up with environment variables that >> made the new machines tokenize as a single word. >> >> Maybe this is a perl bug, but the result is that two different machines >> running the same perl script produce different tokenization :-(. >> >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. >> >> Kenneth >> >> <jur.gz><perl_V.txt>_______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
