On Wednesday, June 25, 2003 4:31 PM, Andrew C. West <[EMAIL PROTECTED]> wrote:
> On Wed, 25 Jun 2003 15:05:26 +0400, "Valeriy E. Ushakov" wrote:
> What I'm suggesting is that although "cui" <0F45, 0F74, 0F72> and
> "ciu" <0F45, 0F72, 0F74> should be rendered identically, the logical
> ordering of the codepoints representing the vowels may represent
> lexical differences that would be lost during the process of
> normalisation. 

This is an excellent argument, and that's why the Vietnamese usage of multiple 
diacritics was studied so that it can preserve the logical ordering of accents on 
Latin letters. However if the actual rendered text cannot be distinguished, the 
effective order of diacritics is only important in the mind of the reader but does not 
exist in the written form.

This would be important if there was a need to create a transliteration rule (for 
example from Tibetan to Latin script). But even in that case, knowledge of the origin 
language is required, as no transliteration rule works well usig only the script 
information. So transliteration rules are very often context-sensitive.

What is important is how a native Tibetan reader would read the grapheme cluster. If 
it reads it as "ciu" then it is to be interpreted as "ciu", and then the logical order 
is more important than the encoding order, because such difference do not exist in the 
actual written script.

If I just take the example of the Latin script, a sequence like <C, COMBINING CEDILLA, 
COMBINING ACCUTE ACCENT> will have a canonical order for the two last diacritics which 
is not important at the linguisitic level if looking at the written script. The 
canonical order and comining classes just exists BECAUSE the encoding would allow 
several *equivalent* sequences that no reader would be allow to read distinctly. When 
there is possible confusions, and these distinction does not exist in the original 
script before its encoding, there should exist a way to unify all these.

So even if the canonical ordering of Tibetan vowel signs is not logical, as long as it 
allows to produce the same written text, this is not a problem, and there is not more 
loss of semantic than in the original script.

So if the Tibetan script cannot make a distinction between "ciu" and "cui", this is 
*not* a Unicode defect. This confusion already exists in the original script, and 
there is no loss of semantic in the Unicode encoding when compared to the actual 
written script. Let's not make a problem by adding new semantics to the Tibetan 
language (such as creating a distinction between "ciu" and "cui") *because* this seems 
/possible/ in Unicode. If we respect a script or language, we must not tolerate such 
artificial distinctions.

It's true that the canonical ordering should match with the logical ordering, but I 
think that there is a lot of exceptions, notably within Brahmic scripts with disjoint 
letters, or in Thai (encoded according to a previous existing standard TIS620 which 
used the visual ordering), or even in many Hebrew or Arabic texts (sometimes encoded 
also with a visual ordering, and requiring some tools to reverse the encoding 
according to a prefered order, because this cannot be decided without an out-of-band 
specification of the actual ordering used in the text)...

So if one wants to really handle the logical ordering, it's perfectly possible to 
exchange the "i" and "u" in "cui" without affecting the canonical equivalence and 
without changing the semantic of the original Tibetan text. Canonical ordering is only 
needed to unify equivalences, but is not intended to sort distinct strings (this is 
not part of the Unicode encoding, but part of a collation algorithm like UCA, tailored 
appropriately for each language on top of the default UCA order for the script).

A correct UCA collation for the Tibetan script can perfectly be created, and then 
tailored for the Tibetan language to reorder the vowel signs. (This is not more 
complicated than handling a French reordering for accents). This just requires a 
multi-level sort algorithm, where "u" and "i" would have the same collation keys at 
level N, and could be reordered using a French-style reordering of vowel signs for 
keywords or grapheme clusters at level N+1 or N+2.


Reply via email to