I've been having some problems with spurious dotted circles in various versions of HarfBuzz, and I thought I would share before proposing a complete solution to Behdad. I've been looking at 3 versions of HarfBuzz:
'LibreOffice 4.3.4', i.e. whatever (clearly old) version of HarfBuzz is in that version of LibreOffice. I know its old, because its normalisation orders U+1A60 SAKOT before the tone marks. I have lookups in place to ameliorate that problem. 'HarfBuzz 0.9.38+', i.e. the latest sources at some time today. 'New ISC', i.e. HarfBuzz 0.9.38+ plus changes to Indic Syllable Category (ISC) as I suggested on the Unicode list on 17 May 2014 (post 'Indic Syllable Categories' http://www.unicode.org/mail-arch/unicode-ml/y2014-m05/0038.html). These categories are defined in HarfBuzz by file hb-ot-shape-complex-indic-table.cc. I was about to formally submit my suggestions to the Unicode Technical Committee, but then I discovered that the changes would adversely affect HarfBuzz. The first problem arose with U+1A7B MAI SAM. While there is no problem with its uses to indicate word (or phrase) repetition by marking the last akshara and to indicate the merger of two 1-consonant vowelless consonant stacks, a dotted circle occurs in the example example /thanon/ <U+1A33 HIGH THA, U+1A60 SAKOT, U+1A36 NA, U+1A7B MAI SAM, U+1A6B SIGN O, U+1A41 RA>. The problem is that MAI SAM has an ISC of 'other', so U+25CC in inserted before SIGN O. Making MAI SAM a 'dependent vowel' as I had suggested fixed this problem. The second problem arose with U+1A7A RA HAAM, and could also arise with U+1A7C KARAN. The problem is that with the influx of foreign loans into Thai, in Thailand there are now clusters of two consonants in which the *first* consonant cluster is silent. In most cases, there is no way for Tai Tham to show which is silent, but when the tail of the second consonant rises to the hanging baseline, the placement of the cancellation marks tends to show which consonant is cancelled. A (hpyothetical) example is the English surname 'Dawes', which is represented with three consonants in Thai. The transliteration of 'w' is marked as silent. Conversely, 'Howes' would be written with the transliteration of the 's' as silent. This prevents the font deciding the placement of the cancellation mark on a cluster by cluster basis. Following the lead of Thai, this would be written <U+1A2F DA, U+1A6C SIGN OA BELOW, U+1A45 WA, U+1A7A RA HAAM, U+1A60 SAKOT, U+1A48 HIGH SA>. LibreOffice 4.3.4 splits the cluster into three syllables, <WA, SAKOT>, <RA HAAM> and <HIGH SA>, and the problem is simply that the subscript form cannot be generated until after the syllable boundaries are dropped. This is simply a variant of the font-soluble but for the future eliminated tone and SAKOT problem. HarfBuzz 0.9.38+ also splits the cluster into three syllables, <WA>, <RA HAAM>, <U+25CC, SAKOT, HIGH SA> because RA HAAM has an ISC of 'other'. New ISC marks RA HAAM as a 'pure killer'. Unfortunately, this does not change the misdeduced syllable structure. I think the analysis needs to treat the sequence 'pure killer', 'invisible stacker' as being within a single syllable. Is this too much to ask for? The third problem arose with U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC DOT, and possibly is not a real problem. I have too few examples of the character's use. CRYPTOGRAMMIC DOT currently has an ISC of 'other', so LibreOffice 4.3.4 and HarfBuzz 0.9.38+ split the sequence <U+1A49 HIGH HA, U+1A7F CRYPTOGRAMMIC DOT, U+1A63 SIGN AA> into three syllables, <HIGH HA>, <CRYPTOGRAMMIC DOT> and <U+25CC, SIGN AA>. It is possible that the input sequence will not occur in the wild. In 'New ISC', CRYPTOGRAMMIC DOT is reclassified as a 'nukta', and the sequence is treated as a single syllable, as desired. The next problem was with the admittedly unusual writing <U+1A93 THAM DIGIT THREE, U+1A60 SAKOT, U+1A34 LOW TA> 'three times'. None of the three versions allowed the digit to be treated as a consonant base, and so U+25CC was introduced before SAKOT. Does the SEA engine need to be specifically instructed to treat Tai Tham decimal numbers as potential character bases? Some of my changes for 'New ISC' had bad consequences. Changing U+1A53 TAI THAM LETTER LAE from a letter to an independent vowel resulted in <U+1A29 LOW CA, U+1A60 SAKOT, U+1A53 LAE> being split into two syllables, <LOW CA, SAKOT> and <LAE>. While the font can work round this, this is not good. Changing U+1A74 TAI THAM SIGN MAI KANG from 'dependent vowel' to 'bindu' resulted in the word <U+1A37 BA, U+1A74 MAI KANG, U+1A75 TONE-1> being split into two syllables, <BA, MAI KANG> and <U+25CC, TONE-1>. This seems odd; U+0ECD LAO NIGGAHITA is classified by Unicode as 'bindu', yet regularly has tone marks mounted on it. Is the syllable splitting here a HarfBuzz error? Richard. _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
