Difficulties due to the present combining class values attached to these characters most frequently occur with abbreviations/contractions and/or with cursive scripts. With abbreviations it is common to have two or more vowels on a consonant stack. In cursive or semi-cursive forms of Tibetan script the subjoined vowels 0F71, 0F74 and 0F75 form ligatures with the consonant(s) in the stack, while above headline vowel(s) such as U+0F72 U+0F7A and U+0F7C sometimes forms a ligature with the following consonant or punctuation mark.
In Dzongkha (Bhutanese) abbreviated spellings are often the usual way of writing words and a semi-cursive form of Tibetan script (Joyig) is standard - so the problem frequently occurs. I have a 225 page dictionary, and several other lists, of common abbreviations which are full of examples where this problem occurs. I've attached a couple of real and fairly simple examples. Example 1 ======== Following normal orthographic rules the characters to produce Example1_gtuig.jpg would be entered as: U+0F42 U+0F4F U+0F74 U+0F72 U+0F42 If the characters remain in that order there is no problem - the first U+0F42 is straight forward, the isolated character is displayed as a simple glyph "uni0F42" the sequence U+0F4F U+0F74 is replaced by a ligature "uni0F4F0F74" U+0F72 U+0F42 is replaced by a ligature "uni0F720F42" Now if the text goes through a "normalisation" process the same text ends up reordered as: U+0F42 U+0F4F U+0F72 U+0F74 U+0F42 because the combining class value of U+0F72 is less than that of U+0F74. To render this there is no change for the first character but I now need a lookup to render the whole sequence: U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 with two glyphs "uni0F4F0F74 uni0F720F42" Example 2 ======== Following normal orthographic rules the characters to produce Example1_gtuop.jpg would be entered as: U+0F42 U+0F4F U+0F74 U+0F7C U+0F54 If the characters remain in that order there is no proplem - the first U+0F42 is as in the first example the sequence U+0F4F U+0F74 is replaced by a ligature "uni0F4F0F74" U+0F7C U+0F54 is replaced by a ligature "uni0F7C0F54" However, since the combining class value of U+0F7C is less than that of U+0F74,. after a "normalisation" process the same text ends up reordered as: U+0F42 U+0F4F U+0F7C U+0F72 U+0F54 and the whole sequence: U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 needs to be replaced with the two glyphs "uni0F4F0F74 uni0F720F42". Example 3 - (Example3_aMi-aiM.jpg) ============================== This is taken from an entirely different source, the "TibetBT" font which was specially created for a project in Sichuan digitising the Tibetan bstan-'gyur (a vast cannonical collection of texts in over 200 large volumes originally translated fromSanskrit into Tibetan). The glyph set of the font is the same as the the set of Tibetan stacks found in that collection. All stacks including any combining vowels are implemented as precomposed ligatures This font can be downloaded from (though it is wrapped-up in a Windows "setup.exe" file). Here we have two stacks which one would naturally enter as U+0F68 U+0F7E U+0F72 and U+0F68 U+0F72 U+0F7E respectively. No problem so long as the characters remain in that order. However since U+0F72 has a combining class value greater than that of U+0F7E - in a process of "normalisation" U+0F72 would always float to the end and both stings would end up as U+0F68 U+0F7E U+0F72 and be indistinguishable. If there were only a few and fixed number of cases like the first two examples it would not be *much* of a problem to add the extra lookups - even though my font would need both "many to one" and "many to many" lookups to handle it. But there are *numerous* cases I already know of and there is no fixed and final list of such abbreviations. So I should really build the tables in my font to be able to handle almost any possibility. If the combining classes of vowels & marks were based on the expected order where subjoined vowels are always written before any above headline vowels, this would be reasonably straight-forward to do - but as they may now wind up after normalisation it requires adding a huge number of complex lookups to the tables in my font. - Once I've done this it is going to be very difficult to test all the permeutations. Because of the number of additional lookups I need it is also likely there will be a hefty performance hit - especially on reflowing large documents. Unfortunately the third example can't simply be fixed by font lookups since two distinct combinations wind up being identical and hence would have to be rendered identically. If I wrote a peice of software where values I'd assigned caused problems and innefficiencies like this, I'd count it as a major fault or bug and hurry to fix it by assigning the correct values. I know the Tibetan characters were discussed in great detail by a number of "experts" at the time they were encoded - however there was little or no substantial discussion amongst these experts about the cannonical combining class values assigned to the characters by the UTC. If the combining classes of Tibetan dependant vowels had been based on the order in which these characters are normally written or typed there would not be this problem in processing them. I beleive that correcting the cannonical combining class values of these characters is the best solution. Leaving things as they are is going to cause a lot of extra work for implementors and inefficiencies in implementations. There is no work-around for the problem illustrated by Example 3. Someone suggested encoding an otherwise identical set of characters with the correct CCCV values and depreciating the existing ones but this is not a real solution only a kludge. - And how could encoding otherwise identical characters in ISO/IEC-10646 be justified since that standard does not specify cannonical combining class values of characters? - Chris Christopher Fynn 4 Chester Court 84 Salusbury Road London NW6 6PA
<<attachment: Example2_gTuop.jpg>>
<<attachment: Example1_gTuig.jpg>>
<<attachment: Example3_aMi_aiM.jpg>>

