In UTS 39, it says, that optionally,

"Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD].

"The criterion can only be applied if the language of the string is known to be Chinese."

What does it mean for the language to "be known to be Chinese"? Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD?

The example given shows some Hirigana in the text. That clearly indicates the language isn't Chinese. So in this example we can algorithmically rule out that its Chinese.

And what does Chinese really mean here?

Reply via email to