Re: Counting Devanagari Aksharas
> Date: Wed, 26 Apr 2017 07:45:07 +0100 > From: Richard Wordingham via Unicode> > On Wed, 26 Apr 2017 08:48:13 +0300 > Eli Zaretskii via Unicode wrote: > > > > Date: Sun, 23 Apr 2017 22:59:49 +0100 > > > From: Richard Wordingham > > > Cc: Eli Zaretskii > > > > > > If I search for CGJ, highlighting it is frequently supremely > > > useless. I want to know where it is; highlighting is merely a tool > > > to find it on the screen. > > > > So I guess this means highlighting is useful after all ;-) > > ᩺Not if the area highlit is zero pixels wide. If you elide too much of the context, the discussion could lose all of its meaning. Let me restore some of the relevant context: > > > > > On 2017-04-22, Eli Zaretskii via Unicode wrote: > > > > > > > > > > I could imagine Emacs decomposing characters temporarily when only > > > > > > part of a cluster matches the search string. Assuming this would > > > > > > make sense to users of some complex scripts, that is. You are > > > > > > welcome to suggest such a feature by using report-emacs-bug. > > > > > > > > The cursor moves to the cluster boundary, so there is much less of a > > > > problem with Emacs. > > > > > > But you wanted to highlight only part of the cluster, AFAIU. > > > > If I search for CGJ, highlighting it is frequently supremely useless. > > I want to know where it is; highlighting is merely a tool to find it on > > the screen. > > So I guess this means highlighting is useful after all ;-) IOW, the context was a suggestion to temporarily disable character composition, in which case CGJ _will_ be displayed as non-zero width glyph, at least in the default Emacs display configuration, and CGJ _will_ be visible with its highlight.
Re: Counting Devanagari Aksharas
On Wed, 26 Apr 2017 08:48:13 +0300 Eli Zaretskii via Unicodewrote: > > Date: Sun, 23 Apr 2017 22:59:49 +0100 > > From: Richard Wordingham > > Cc: Eli Zaretskii > > > > If I search for CGJ, highlighting it is frequently supremely > > useless. I want to know where it is; highlighting is merely a tool > > to find it on the screen. > > So I guess this means highlighting is useful after all ;-) ᩺Not if the area highlit is zero pixels wide. Richard.
Re: Counting Devanagari Aksharas
> Date: Sun, 23 Apr 2017 22:59:49 +0100 > From: Richard Wordingham> Cc: Eli Zaretskii > > If I search for CGJ, highlighting it is frequently supremely useless. > I want to know where it is; highlighting is merely a tool to find it on > the screen. So I guess this means highlighting is useful after all ;-)
Re: Go romanize! Re: Counting Devanagari Aksharas
Quote from below: The word indeed means 'danger' (Pali/Sanskrit _antarāya_). The pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai Tham script no longer have /r/. The older sequence /tr/ normally became /tʰ/ (except in Lao), but the spelling has not been updated - at least, not amongst the more literate. The script has a special symbol for the short vowel /o/, which it shares with the Lao script. This symbol is used in writing that word. Two ways I have seen it spelt, each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy. I have also seen a form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy. However, I have seen nothing that shows that I won't encounter ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives? Response: Perhaps this word is derived from Sanskrit 'anþaraða' (Search: antarada at http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche) Sinhala:anþaraaðaayakayi, anþaraava, anþaraavayi, anþraava, anþraavayi Use this font to read the above Sinhala words: http://smartfonts.net/ttf/aruna.ttf -=- svasþi siððham! -=- On 4/25/2017 2:07 AM, Richard Wordingham via Unicode wrote: On Mon, 24 Apr 2017 20:53:12 +0530 Naena Guru via Unicodewrote: Quote by Richard: Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks ofhttp://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". MY RESPONSE: Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. (letter, ruupa). Reduce the writing system to its consonants, vowels etc. (zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best technical facilities for any language. (This is why now more than 130 languages romanize despite Unicode). Use English letters for similar sounds in the native speech. Now, treat all combinations as ligatures. For example, 'po' sound in Indic has the p consonant with a sign ahead plus a sign after. In many Indic scripts, yes. In Devanagari, the vowel sign is normally a singly element classified as following the consonant. In Thai, the vowel sign precedes the consonant. Tai Tham uses both a two-part sign and a preceding sign. The preceding sign is for Tai words and the two-part sign for Pali words, but loanwords from Pali into the Tai languages may retain the two part sign. For the font, there is no difference between the way it makes the combination 'ä', which has a sign above and the Indic having two on either side. For OpenType, there is. The first can be made by providing a simple table of where the diaeresis goes relative to the base characters, in this case the diaeresis. The second is painfully complicated, for the 'p' may have other marks attached to it, so doing it be relative positioning is painfully complicated and error-prone. This job is given to the rendering engine, which may introduce its own problems. AAT and Graphite offer the font maker the ability to move the 'sign ahead' from after the 'p' to before it. Recall that long ago, Unicode stopped defining fixed ligatures and asked the font makers to define them in the PUA. While the first is true enough, I believe the second is false. Not every glyph has to be mapped to by a single character. I don't do that for contextual forms or ligatures in my font. Spelling and speech: There is indeed a confusion about writing and reading in Hindi, as I have observed. Like in English and Tamil, Hindi tends to end words with a consonant. So, there is this habit among the Hindi speakers to drop the ending vowel, mostly 'a' from words that actually end with it. For example, the famous name Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel ending and are traditionally spoken as such. This loss is also to be found in Further India. Thai, Lao and Khmer now require that such a word-final vowel be written explicitly if it is still pronounced. Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. If I am to make a guess, its ending is like in ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I am right, this is a good example of decline if a writing system owing to bad, uncaring application of technology. We are in the Digital Age, and we need not compromise any more. In fact, we can fix errors and decadence introduced by past technologies. The word
Re: Go romanize! Re: Counting Devanagari Aksharas
On Mon, 24 Apr 2017 20:53:12 +0530 Naena Guru via Unicodewrote: > Quote by Richard: > Unless this implies a spelling reform for many languages, I'd like to > see how this works for the Tai Tham script. I'm not happy with the > Romanisation I use to work round hostile rendering engines. (My > scheme is only documented in variable hack_ss02 in the last script > blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For > example, there are several different ways of writing what one might > naively record as "ontarAy". > > MY RESPONSE: > Richard, I stuck to the two specifications (Unicode and Font) and > Sanskrit grammar. The akSara has two aspects, its sound (zabða, > phoneme) and its shape. (letter, ruupa). Reduce the writing system to > its consonants, vowels etc. (zabða) and assign SBCS letters/codes to > them (ruupa). SBCS provides the best technical facilities for any > language. (This is why now more than 130 languages romanize despite > Unicode). Use English letters for similar sounds in the native > speech. Now, treat all combinations as ligatures. For example, 'po' > sound in Indic has the p consonant with a sign ahead plus a sign > after. In many Indic scripts, yes. In Devanagari, the vowel sign is normally a singly element classified as following the consonant. In Thai, the vowel sign precedes the consonant. Tai Tham uses both a two-part sign and a preceding sign. The preceding sign is for Tai words and the two-part sign for Pali words, but loanwords from Pali into the Tai languages may retain the two part sign. > For the font, there is no difference between the way it makes > the combination 'ä', which has a sign above and the Indic having two > on either side. For OpenType, there is. The first can be made by providing a simple table of where the diaeresis goes relative to the base characters, in this case the diaeresis. The second is painfully complicated, for the 'p' may have other marks attached to it, so doing it be relative positioning is painfully complicated and error-prone. This job is given to the rendering engine, which may introduce its own problems. AAT and Graphite offer the font maker the ability to move the 'sign ahead' from after the 'p' to before it. > Recall that long ago, Unicode stopped defining fixed > ligatures and asked the font makers to define them in the PUA. While the first is true enough, I believe the second is false. Not every glyph has to be mapped to by a single character. I don't do that for contextual forms or ligatures in my font. > Spelling and speech: > There is indeed a confusion about writing and reading in Hindi, as I > have observed. Like in English and Tamil, Hindi tends to end words > with a consonant. So, there is this habit among the Hindi speakers to > drop the ending vowel, mostly 'a' from words that actually end with > it. For example, the famous name Jayantha (miserable mine too, haha! > = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It > is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel > ending and are traditionally spoken as such. This loss is also to be found in Further India. Thai, Lao and Khmer now require that such a word-final vowel be written explicitly if it is still pronounced. > Looking at the word you gave, ontarAy, it looks to me like an > Anglicized form. If I am to make a guess, its ending is like in > ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I > am right, this is a good example of decline if a writing system owing > to bad, uncaring application of technology. We are in the Digital > Age, and we need not compromise any more. In fact, we can fix errors > and decadence introduced by past technologies. The word indeed means 'danger' (Pali/Sanskrit _antarāya_). The pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai Tham script no longer have /r/. The older sequence /tr/ normally became /tʰ/ (except in Lao), but the spelling has not been updated - at least, not amongst the more literate. The script has a special symbol for the short vowel /o/, which it shares with the Lao script. This symbol is used in writing that word. Two ways I have seen it spelt, each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy. I have also seen a form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy. However, I have seen nothing that shows that I won't encounter ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives? Richard.
Go romanize! Re: Counting Devanagari Aksharas
Quote by Richard: Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". MY RESPONSE: Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. (letter, ruupa). Reduce the writing system to its consonants, vowels etc. (zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best technical facilities for any language. (This is why now more than 130 languages romanize despite Unicode). Use English letters for similar sounds in the native speech. Now, treat all combinations as ligatures. For example, 'po' sound in Indic has the p consonant with a sign ahead plus a sign after. For the font, there is no difference between the way it makes the combination 'ä', which has a sign above and the Indic having two on either side. Recall that long ago, Unicode stopped defining fixed ligatures and asked the font makers to define them in the PUA. Spelling and speech: There is indeed a confusion about writing and reading in Hindi, as I have observed. Like in English and Tamil, Hindi tends to end words with a consonant. So, there is this habit among the Hindi speakers to drop the ending vowel, mostly 'a' from words that actually end with it. For example, the famous name Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel ending and are traditionally spoken as such. Dictionary is a commercial invention. When Caxton brought lead types to England, French-speaking Latin-flaunting elites did not care about the poor natives. Earlier, invading Romans forced them to drop Fuþark and adopt the 22-letter Latin alphabet. So, they improvised. Struck a line across d and made ð, Eth; added a sign to 'a' and made æ (Asc) and continued using Thorn (þ) by rounding the loop. Lead type printing hit English for the second time, ruining it as the spell standardizing began. Dictionaries sold. THE POWERFUL CAN RUIN PEOPLE'S PROPERTY BECAUSE THEY CAN IN ORDER TO MAKE MONEY. Unicode enthusiasts, take heed! Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. If I am to make a guess, its ending is like in ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I am right, this is a good example of decline if a writing system owing to bad, uncaring application of technology. We are in the Digital Age, and we need not compromise any more. In fact, we can fix errors and decadence introduced by past technologies. RICHARD: That sounds like a letter-assembly system. MY RESPONSE: Nothing assembled there, my friend. On 4/24/2017 12:38 PM, Richard Wordingham via Unicode wrote: On Mon, 24 Apr 2017 00:36:26 +0530 Naena Guru via Unicodewrote: The Unicode approach to Sanskrit and all Indic is flawed. Indic should not be letter-assembly systems. Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the speech. Each writing system then assigns a shape to the phonetically precise phoneme. The most technically and grammatically proper solution for Indic is first to ROMANIZE the group of writing systems at the level of phonemes. That is, assign romanized shapes to vowels, consonants, prenasals, post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. This approach is similar to how European languages picked up Latin, improvised the script and even uses Simples and Capitals repertoire. Romanizing immediately makes typing easier and eliminates sometimes embarrassing ambiguity in Anglicizing -- you type phonetically on key layouts close to QWERTY. (Only four positions are different in Romanized Sinhala layout). If we drop the capitalizing rules and utilize caps to indicate the 'other' forms of a common letter, we get an intuitively typed system for each language, and readable too. When this is done carefully, comparing phoneme sets of the languages, we can reach a common set of Latin-derived SINGLE-BYTE letters completely covering all phonemes of all Indic. Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". Next, each native script can be obtained by making
Re: Counting Devanagari Aksharas
On Mon, 24 Apr 2017 00:36:26 +0530 Naena Guru via Unicodewrote: > The Unicode approach to Sanskrit and all Indic is flawed. Indic > should not be letter-assembly systems. > > Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of > the speech. Each writing system then assigns a shape to the > phonetically precise phoneme. > > The most technically and grammatically proper solution for Indic is > first to ROMANIZE the group of writing systems at the level of > phonemes. That is, assign romanized shapes to vowels, consonants, > prenasals, post-vowel phonemes (anusvara and visarjaniiya with its > allophones) etc. This approach is similar to how European languages > picked up Latin, improvised the script and even uses Simples and > Capitals repertoire. Romanizing immediately makes typing easier and > eliminates sometimes embarrassing ambiguity in Anglicizing -- you > type phonetically on key layouts close to QWERTY. (Only four > positions are different in Romanized Sinhala layout). > > If we drop the capitalizing rules and utilize caps to indicate the > 'other' forms of a common letter, we get an intuitively typed system > for each language, and readable too. When this is done carefully, > comparing phoneme sets of the languages, we can reach a common set of > Latin-derived SINGLE-BYTE letters completely covering all phonemes of > all Indic. Unless this implies a spelling reform for many languages, I'd like to see how this works for the Tai Tham script. I'm not happy with the Romanisation I use to work round hostile rendering engines. (My scheme is only documented in variable hack_ss02 in the last script blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.) For example, there are several different ways of writing what one might naively record as "ontarAy". > Next, each native script can be obtained by making orthographic smart > fonts that display the SBCS codes in the respective shapes of the > native scripts. That sounds like a letter-assembly system. So how does your scheme help one split words into orthographic syllables? > I have successfully romanized Sinhala and revived the full repertoire > of Sinhla + Sanskrit orthography losing nothing. Sinhala script is > perhaps the most complex of all Indic because it is used to write > both Sanskrit and Pali. What complication does Pali impose on top of Sanskrit. As far as I'm aware, it just needs one extra letter, usually called LLA, which you will already have if 'Sanskrit' includes Vedic Sanskrit. > See this: http://ahangama.com/ (It's all SBCS underneath). > Test here: http://ahangama.com/edit.htm All I get for these are blank pages. Perhaps there's an unreported communication failure in the network, Richard.
Re: Counting Devanagari Aksharas
On Sun, 23 Apr 2017 05:40:29 +0300 Eli Zaretskii via Unicodewrote: > > The cursor moves to the cluster boundary, so there is much less of a > > problem with Emacs. > > But you wanted to highlight only part of the cluster, AFAIU. If I search for CGJ, highlighting it is frequently supremely useless. I want to know where it is; highlighting is merely a tool to find it on the screen. Richard.
Re: Counting Devanagari Aksharas
The Unicode approach to Sanskrit and all Indic is flawed. Indic should not be letter-assembly systems. Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the speech. Each writing system then assigns a shape to the phonetically precise phoneme. The most technically and grammatically proper solution for Indic is first to ROMANIZE the group of writing systems at the level of phonemes. That is, assign romanized shapes to vowels, consonants, prenasals, post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. This approach is similar to how European languages picked up Latin, improvised the script and even uses Simples and Capitals repertoire. Romanizing immediately makes typing easier and eliminates sometimes embarrassing ambiguity in Anglicizing -- you type phonetically on key layouts close to QWERTY. (Only four positions are different in Romanized Sinhala layout). If we drop the capitalizing rules and utilize caps to indicate the 'other' forms of a common letter, we get an intuitively typed system for each language, and readable too. When this is done carefully, comparing phoneme sets of the languages, we can reach a common set of Latin-derived SINGLE-BYTE letters completely covering all phonemes of all Indic. Next, each native script can be obtained by making orthographic smart fonts that display the SBCS codes in the respective shapes of the native scripts. I have successfully romanized Sinhala and revived the full repertoire of Sinhla + Sanskrit orthography losing nothing. Sinhala script is perhaps the most complex of all Indic because it is used to write both Sanskrit and Pali. See this: http://ahangama.com/ (It's all SBCS underneath). Test here: http://ahangama.com/edit.htm On 4/20/2017 5:05 AM, Richard Wordingham via Unicode wrote: Is there consensus on how to count aksharas in the Devanagari script? The doubts I have relate to a visible halant in orthographic syllables other than the first. For example, according to 'Devanagari VIP Team Issues Report' http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a derived form from Nepali श्रीमान् should be written श्रीमान्को and not श्रीमान्को . Now, if the font used has a conjunct for SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO and the latter as having 3 aksharas SH.RII, MAA, N.KO. If the font leads to the use of a visible halant instead of the vattu conjunct SH.RA, as happens when I view this email, would there then be 5 and 4 aksharas respectively? A further complication is that the font chosen treats what looks like SH, RA as a conjunct; the vowel I appears to the left of SH when added after RA (श्रि). Richard.
Re: Counting Devanagari Aksharas
On 4/22/2017 9:25 PM, Manish Goregaokar via Unicode wrote: Backspace in browsers (chrome and firefox) deletes within EGCs too. They delete matras in devanagari, and jamos in hangul. They don't *exactly* work off of code points (e.g. flag emoji gets deleted as a whole in many backspace implementations) Flag emoji and many other "invisible" sequences are different from ligatures and conjuncts in one important way: their elements are not usually key strokes, but the full sequence would be inserted from a pick list or other type of input method. If you didn't "type" each of the elements of the sequence, then deleting individual ones is something you would only need for debugging or other specialized purposes, not for undoing a physical action (keystroke) in reverse order. Speaking of undoing: not all editors always support full key-stroke by key-stroke undo, some will coalesce longer runs of text. This saves on space for the undo buffer, but also makes undoing more extensive edits less painful. It's clearly a personal preference whether such "streamlining" would feel "right" or "bothersome". Beyond the last line typed, or two, I may really not care if undo went word by word, say. A./
Re: Counting Devanagari Aksharas
> You cannot even > meaningfully move by single characters in most clusters, because > composing characters generally completely changes how the original > characters looked, so there's nowhere you can display the cursor. Yes, and this is one of the reasons it feels broken in devanagari, you get cursors in the midst of aksharas, in weird places. Backspace in browsers (chrome and firefox) deletes within EGCs too. They delete matras in devanagari, and jamos in hangul. They don't *exactly* work off of code points (e.g. flag emoji gets deleted as a whole in many backspace implementations) -Manish On Sat, Apr 22, 2017 at 12:22 PM, Eli Zaretskii via Unicodewrote: >> Date: Sat, 22 Apr 2017 17:13:36 +0100 >> From: Richard Wordingham via Unicode >> >> > Movement by grapheme >> > cluster is AFAIK the most natural way of moving in complex scripts. >> >> Evidence? > > Personal experience? > >> It's easiest for displaying the cursor. > > It's the _only_ way of displaying the cursor. You cannot even > meaningfully move by single characters in most clusters, because > composing characters generally completely changes how the original > characters looked, so there's nowhere you can display the cursor. And > without being able to position the cursor, a visual feedback to the > user becomes troublesome at best. > >> I've encountered the problem that, while at least I can search for >> text smaller than a cluster, there's no indication in the window of >> where in the window the text is. > > I could imagine Emacs decomposing characters temporarily when only > part of a cluster matches the search string. Assuming this would make > sense to users of some complex scripts, that is. You are welcome to > suggest such a feature by using report-emacs-bug. > >> SIL's Graphite supports the idea of a split cursor, which >> shows the glyphs corresponding to the characters before and after the >> cursor position. > > I find split-cursor to be a nuisance, FWIW. IME, it confuses the > users without making anything much clearer.
Re: Counting Devanagari Aksharas
> Date: Sun, 23 Apr 2017 00:51:59 +0100 > Cc: Julian Bradfield> From: Richard Wordingham via Unicode > > On Sat, 22 Apr 2017 21:39:42 +0100 (BST) > Julian Bradfield via Unicode wrote: > > > On 2017-04-22, Eli Zaretskii via Unicode wrote: > > > > I could imagine Emacs decomposing characters temporarily when only > > > part of a cluster matches the search string. Assuming this would > > > make sense to users of some complex scripts, that is. You are > > > welcome to suggest such a feature by using report-emacs-bug. > > The cursor moves to the cluster boundary, so there is much less of a > problem with Emacs. But you wanted to highlight only part of the cluster, AFAIU. > > That's what I do in my emacs with combining characters, and if I had > > complex script support, I'd expect the same to happen there. > > emacs is a programmer's editor, after all :) > > Emacs probably has a way of toggling complex script support somewhere. > I'm torn between seeing the text properly set out and seeing exactly > what it is that I've typed. 'Reveal codes' doesn't seem widely > supported. "M-x auto-composition-mode RET" should do what you want.
Re: Counting Devanagari Aksharas
On Sat, 22 Apr 2017 21:39:42 +0100 (BST) Julian Bradfield via Unicodewrote: > On 2017-04-22, Eli Zaretskii via Unicode wrote: > > I could imagine Emacs decomposing characters temporarily when only > > part of a cluster matches the search string. Assuming this would > > make sense to users of some complex scripts, that is. You are > > welcome to suggest such a feature by using report-emacs-bug. The cursor moves to the cluster boundary, so there is much less of a problem with Emacs. > That's what I do in my emacs with combining characters, and if I had > complex script support, I'd expect the same to happen there. > emacs is a programmer's editor, after all :) Emacs probably has a way of toggling complex script support somewhere. I'm torn between seeing the text properly set out and seeing exactly what it is that I've typed. 'Reveal codes' doesn't seem widely supported. Richard.
Re: Counting Devanagari Aksharas
On 2017-04-22, Eli Zaretskii via Unicodewrote: >> From: Richard Wordingham via Unicode [...] >> I've encountered the problem that, while at least I can search for >> text smaller than a cluster, there's no indication in the window of >> where in the window the text is. > > I could imagine Emacs decomposing characters temporarily when only > part of a cluster matches the search string. Assuming this would make > sense to users of some complex scripts, that is. You are welcome to > suggest such a feature by using report-emacs-bug. That's what I do in my emacs with combining characters, and if I had complex script support, I'd expect the same to happen there. emacs is a programmer's editor, after all :) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Counting Devanagari Aksharas
> Date: Sat, 22 Apr 2017 17:13:36 +0100 > From: Richard Wordingham via Unicode> > > Movement by grapheme > > cluster is AFAIK the most natural way of moving in complex scripts. > > Evidence? Personal experience? > It's easiest for displaying the cursor. It's the _only_ way of displaying the cursor. You cannot even meaningfully move by single characters in most clusters, because composing characters generally completely changes how the original characters looked, so there's nowhere you can display the cursor. And without being able to position the cursor, a visual feedback to the user becomes troublesome at best. > I've encountered the problem that, while at least I can search for > text smaller than a cluster, there's no indication in the window of > where in the window the text is. I could imagine Emacs decomposing characters temporarily when only part of a cluster matches the search string. Assuming this would make sense to users of some complex scripts, that is. You are welcome to suggest such a feature by using report-emacs-bug. > SIL's Graphite supports the idea of a split cursor, which > shows the glyphs corresponding to the characters before and after the > cursor position. I find split-cursor to be a nuisance, FWIW. IME, it confuses the users without making anything much clearer.
Re: Counting Devanagari Aksharas
On Sat, 22 Apr 2017 13:34:32 +0300 Eli Zaretskii via Unicodewrote: > AFAIR, Emacs allows one to _delete_ individual characters, > i.e. Backspace and C-d delete character-by-character, so the problem > shouldn't be so grave for imperfect typists. Deleting forwards by one _character_ certainly makes life less harsh. It's pleasanter than the UAX#29 suggestion, "For example, on a given system the backspace key might delete by code point, while the delete key may delete an entire cluster". > Movement by grapheme > cluster is AFAIK the most natural way of moving in complex scripts. Evidence? It's easiest for displaying the cursor. I've encountered the problem that, while at least I can search for text smaller than a cluster, there's no indication in the window of where in the window the text is. SIL's Graphite supports the idea of a split cursor, which shows the glyphs corresponding to the characters before and after the cursor position. Richard.
Re: Counting Devanagari Aksharas
> Date: Sat, 22 Apr 2017 11:13:16 +0100 > From: Richard Wordingham via Unicode> > At present these are split into two and three grapheme clusters > respectively, and LibreOffice cursor movement responds accordingly. > (SIGN AA starts a grapheme cluster in several scripts of further > India.) However, if one teaches the Emacs editor what a Tai Tham > syllable is, so that it can use the M17n rendering library, the cursor > then advances syllable by syllable, which is unpleasant for imperfect > typists. AFAIR, Emacs allows one to _delete_ individual characters, i.e. Backspace and C-d delete character-by-character, so the problem shouldn't be so grave for imperfect typists. Movement by grapheme cluster is AFAIK the most natural way of moving in complex scripts.
Re: Counting Devanagari Aksharas
On Fri, 21 Apr 2017 16:27:43 -0700 Manish Goregaokar via Unicodewrote: > > Do Hindi speakers really think of orthographic syllables as > > characters? > > When rendered as a cluster, yes? I've asked around, and folks seem to > insist on coupling it to the rendering. That argues that it's a unit, which I don't think is in dispute. Words are also units, and nowadays we don't normally insist that one retype a word just to change one bit of it. > Given most fonts render > *normal* (common, etc) clusters, I think making them EGCs and looking > at nonrendered clusters the same way we do family emoji is fine > (family emojis of length 5 are a single EGC, but that's not what's > actually perceived by the user, but it's a use case that's very rare > in the wild, so it doesn't matter). That depends on the language. In the Tai Tham script, even without consonant clusters one can get 5 graphic characters in a syllable, e.g. ᨧᩮᩢ᩶ᩣ _cao_ 'lord; you (polite)', and when one adds consonant clusters one easily gets monosyllables like ᨠᩖ᩠᩶ᩅ᩠ᨿ _kluai_ 'banana' with 5 graphic characters and additionally 2 coengs. (One can distinguish Pali from the Tai languages simply by the density of the ink!) At present these are split into two and three grapheme clusters respectively, and LibreOffice cursor movement responds accordingly. (SIGN AA starts a grapheme cluster in several scripts of further India.) However, if one teaches the Emacs editor what a Tai Tham syllable is, so that it can use the M17n rendering library, the cursor then advances syllable by syllable, which is unpleasant for imperfect typists. Fortunately, it's possible to add functions to Emacs to allow it to advance character-by-character; I forget if one has to also add a few code changes. (The downside is that text either side of the cursor is rendered independently, which can be a nuisance when editing very long lines.) > The way I see it, the current > system is wrong, and so would the proposed system of not breaking at > viramas (or not breaking at viramas followed by a consonant if we want > to be more precise), but the proposed system would be wrong much less > often. > I am only talking about Devanagari, though scripts like > Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems > sensible. Indeed, viramas (InSC=Virama) will have to be handled case-by-case. One should continue to break after pulli (U+0BCD TAMIL SIGN VIRAMA) except for the cases of the ligatures/conjuncts. I don't know if there are obscure cases, or whether it's only _shri_ and for which one should not break just because of the virama. Continuation after coengs (InSC=Invisible_Stacker) should be automatic. Malayalam will need customisation. Definitions by codepoints are only a fallback, for when a font cannot be used to guide the process. Formally, normalisation is a problem, as these characters can be separated from letters by other marks. This is a problem in practice for normalised text in Tai Tham. Pure killers (InSC=Pure_Killer) should probably be given no special treatment, as at present, by default, though I wonder if we should define orthographic syllables for Pali in Thai script. The two orthographies will need different rules, and renderers won't help. Defining orthographic syllables for languages in the Latin script is probably excessive. Richard.
Re: Counting Devanagari Aksharas
> Do Hindi speakers really think of orthographic syllables as characters? When rendered as a cluster, yes? I've asked around, and folks seem to insist on coupling it to the rendering. Given most fonts render *normal* (common, etc) clusters, I think making them EGCs and looking at nonrendered clusters the same way we do family emoji is fine (family emojis of length 5 are a single EGC, but that's not what's actually perceived by the user, but it's a use case that's very rare in the wild, so it doesn't matter). The way I see it, the current system is wrong, and so would the proposed system of not breaking at viramas (or not breaking at viramas followed by a consonant if we want to be more precise), but the proposed system would be wrong much less often. I am only talking about Devanagari, though scripts like Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems sensible. -Manish On Fri, Apr 21, 2017 at 4:04 PM, Richard Wordingham via Unicodewrote: > On Thu, 20 Apr 2017 11:17:05 -0700 > Manish Goregaokar via Unicode wrote: > >> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode >> wrote: > >> > Is there consensus on how to count aksharas in the Devanagari >> > script? The doubts I have relate to a visible halant in >> > orthographic syllables other than the first. > >> I don't think there's consensus. > > I've found related discussion at > https://lists.w3.org/Archives/Public/public-i18n-indic/. The question > of how to count was raised and not answered there. > >> On Wed, Apr 19, 2017 at 4:35 PM, >> Richard Wordingham via Unicode wrote: >> > Is there consensus on how to count aksharas in the Devanagari >> > script? The doubts I have relate to a visible halant in >> > orthographic syllables other than the first. > >> I'm of the opinion that Unicode should start considering devanagari >> (and possibly other indic) consonant clusters as single extended >> grapheme clusters. > > Do Hindi speakers really think of orthographic syllables as characters? > > What may be useful is the concept of a definition of an orthographic > syllable. It may be possible to get the information from a font - > depending on the renderer - but a locale-dependent definition should be > possible for use as a fall-back. Devanagari rules won't work for > Tamil, and I think rules for Hindi and Nepali will be slightly > different - looks like a problem. > > The concept is possibly not useful in some Indic scripts - the concept > won't work well in Thai, but will work in Pali in the Thai script, for > both Pali orthographies. > > Richard.
Re: Counting Devanagari Aksharas
On Thu, 20 Apr 2017 11:17:05 -0700 Manish Goregaokar via Unicodewrote: > On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode > wrote: > > Is there consensus on how to count aksharas in the Devanagari > > script? The doubts I have relate to a visible halant in > > orthographic syllables other than the first. > I don't think there's consensus. I've found related discussion at https://lists.w3.org/Archives/Public/public-i18n-indic/. The question of how to count was raised and not answered there. > On Wed, Apr 19, 2017 at 4:35 PM, > Richard Wordingham via Unicode wrote: > > Is there consensus on how to count aksharas in the Devanagari > > script? The doubts I have relate to a visible halant in > > orthographic syllables other than the first. > I'm of the opinion that Unicode should start considering devanagari > (and possibly other indic) consonant clusters as single extended > grapheme clusters. Do Hindi speakers really think of orthographic syllables as characters? What may be useful is the concept of a definition of an orthographic syllable. It may be possible to get the information from a font - depending on the renderer - but a locale-dependent definition should be possible for use as a fall-back. Devanagari rules won't work for Tamil, and I think rules for Hindi and Nepali will be slightly different - looks like a problem. The concept is possibly not useful in some Indic scripts - the concept won't work well in Thai, but will work in Pali in the Thai script, for both Pali orthographies. Richard.
Re: Counting Devanagari Aksharas
That seems like a relatively niche use case (especially with Vedic Sanskrit) compared to having weird selection for everything else. I'm not convinced. When I use a romanized Devanagari input method (I typically do on my laptop), deleting the whole cluster is necessary anyway for things to work well. Direct input methods do let you edit in a more granular way but I've never seen the need for that. I guess this boils down to a matter of opinion and anecdotal experience, so there's not much I can do to convince this list otherwise :) -Manish On Fri, Apr 21, 2017 at 12:23 AM, Richard Wordingham via Unicodewrote: > On Fri, 21 Apr 2017 00:08:24 -0500 > Anshuman Pandey via Unicode wrote: > >> > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode >> > wrote: > >> > Now imagine you're >> > typing Vedic Sanskrit, with its clusters and pitch indicators. > >> I tried typing Vedic Sanskrit, and it seems to work: > >> http://pandey.pythonanywhere.com/devsyll > > That should demonstrate nothing relevant if you type correctly first > time. The issue comes when you mistype and have to correct, to give > the usual worst case, the first letter of a conjunct. Now, I looked at > your page in Firefox on Ubuntu, and I found the cursor seemed to move > by extended grapheme cluster. That means that to change a consonant > you have to retype the following marks. > > I did find two issues with your analyser. > > Firstly, it broke श्रीमान्को into श्री·मा·न्को, which does not > concatenate back to the original. > > Secondly, you have a problem with ANUDATTA. You are not accepting > as a syllable. Perhaps you believed > https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm > as to the structure of a Devanagari syllable. I suspect ANUDATTA as a > consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the > like came in. > > Richard. >
Re: Counting Devanagari Aksharas
On Fri, 21 Apr 2017 00:08:24 -0500 Anshuman Pandey via Unicodewrote: > > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode > > wrote: > > Now imagine you're > > typing Vedic Sanskrit, with its clusters and pitch indicators. > I tried typing Vedic Sanskrit, and it seems to work: > http://pandey.pythonanywhere.com/devsyll That should demonstrate nothing relevant if you type correctly first time. The issue comes when you mistype and have to correct, to give the usual worst case, the first letter of a conjunct. Now, I looked at your page in Firefox on Ubuntu, and I found the cursor seemed to move by extended grapheme cluster. That means that to change a consonant you have to retype the following marks. I did find two issues with your analyser. Firstly, it broke श्रीमान्को into श्री·मा·न्को, which does not concatenate back to the original. Secondly, you have a problem with ANUDATTA. You are not accepting as a syllable. Perhaps you believed https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm as to the structure of a Devanagari syllable. I suspect ANUDATTA as a consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the like came in. Richard.
Re: Counting Devanagari Aksharas
> On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode >wrote: > > On Thu, 20 Apr 2017 14:14:00 -0700 > Manish Goregaokar via Unicode wrote: > >> On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode >> wrote: > >>> On Thu, 20 Apr 2017 11:17:05 -0700 >>> Manish Goregaokar via Unicode wrote: > I'm of the opinion that Unicode should start considering devanagari (and possibly other indic) consonant clusters as single extended grapheme clusters. > >>> You won't like it if cursor movement granularity is reduced to one >>> extended grapheme cluster. I'm grateful that Emacs allows me to > >> I mean, we do the same for Hangul. > > Hangul is generally a maximum of three characters, which is about the > border of tolerance. I find it irritating to have to completely retype > Thai grapheme clusters of consonant, vowel and tone mark. There were > loud protests from the Thais when preposed vowels were added to the > Thai grapheme cluster and implementations then responded, and Unicode > quickly removed them. Now imagine you're typing Vedic Sanskrit, with its > clusters and pitch indicators. I tried typing Vedic Sanskrit, and it seems to work: http://pandey.pythonanywhere.com/devsyll Haven't tried the orthographic oddity of the Nepali case in question. Above my pay grade. If you access the above link on an iOS device you'll see tofu and missing characters. Apple's Devanagari font needs to be fixed. - AP
Re: Counting Devanagari Aksharas
On Thu, 20 Apr 2017 14:14:00 -0700 Manish Goregaokar via Unicodewrote: > On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode > wrote: > > On Thu, 20 Apr 2017 11:17:05 -0700 > > Manish Goregaokar via Unicode wrote: > >> I'm of the opinion that Unicode should start considering devanagari > >> (and possibly other indic) consonant clusters as single extended > >> grapheme clusters. > > You won't like it if cursor movement granularity is reduced to one > > extended grapheme cluster. I'm grateful that Emacs allows me to > I mean, we do the same for Hangul. Hangul is generally a maximum of three characters, which is about the border of tolerance. I find it irritating to have to completely retype Thai grapheme clusters of consonant, vowel and tone mark. There were loud protests from the Thais when preposed vowels were added to the Thai grapheme cluster and implementations then responded, and Unicode quickly removed them. Now imagine you're typing Vedic Sanskrit, with its clusters and pitch indicators. > The main time you need intra-conjunct segmentation in Devanagari is > when deleting something you just typed. You'll typically be several words beyond by the time you notice, or by the time a spell-checker spots a problem. Richard.
Re: Counting Devanagari Aksharas
I mean, we do the same for Hangul. The main time you need intra-conjunct segmentation in Devanagari is when deleting something you just typed. And backspace usually operates on code points anyway (except for some weird cases like flag emoji, though this isn't uniform across platforms). I don't see how intra-conjunct selection would be useful otherwise. -Manish On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicodewrote: > On Thu, 20 Apr 2017 11:17:05 -0700 > Manish Goregaokar via Unicode wrote: > >> When given a rendered representation people seem to uniformly count >> conjuncts as multiple aksharas if rendered with visible halant, and as >> a single akshara if they are rendered conjoined. > > Now, that's what I expected. > >> I'm of the opinion that Unicode should start considering devanagari >> (and possibly other indic) consonant clusters as single extended >> grapheme clusters. Yes, sometimes it's not rendered as a single glyph, >> but sometimes family emoji will not render as a single glyph either >> (if you use skin tones or more than 4 family members) and we still >> consider those EGCs. > > You won't like it if cursor movement granularity is reduced to one > extended grapheme cluster. I'm grateful that Emacs allows me to > delete and replace the first NFC character of a grapheme cluster. > > Richard.
Re: Counting Devanagari Aksharas
On Thu, 20 Apr 2017 11:17:05 -0700 Manish Goregaokar via Unicodewrote: > When given a rendered representation people seem to uniformly count > conjuncts as multiple aksharas if rendered with visible halant, and as > a single akshara if they are rendered conjoined. Now, that's what I expected. > I'm of the opinion that Unicode should start considering devanagari > (and possibly other indic) consonant clusters as single extended > grapheme clusters. Yes, sometimes it's not rendered as a single glyph, > but sometimes family emoji will not render as a single glyph either > (if you use skin tones or more than 4 family members) and we still > consider those EGCs. You won't like it if cursor movement granularity is reduced to one extended grapheme cluster. I'm grateful that Emacs allows me to delete and replace the first NFC character of a grapheme cluster. Richard.
Re: Counting Devanagari Aksharas
On Thu, 20 Apr 2017 15:33:37 +0530 Shriramana Sharma via Unicodewrote: > All I can say is that Tamil script has eschewed most consonant cluster > ligatures/conjoining forms. As for Devanagari, writing श्रीमान्को (I > used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology. > The latter would be Sanskrit orthography and former perhaps Hindi, > although I wouldn't know why anyone would want to run in the को with > the preceding श्रीमान् even in Hindi. According to p23 of http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, it's Nepali. It's a compromise between श्रीमान्को and Hindi-style श्रीमान् को. > And IMO it would be better to > clearly define at the outset what you meant by "akshara" in your > question to avoid confusions by people replying having a different > idea of the meaning of that term. I didn't want to be any more precise than "orthographic syllable". Swaran Lata is urging, in submission http://www.unicode.org/L2/L2017/17094-indic-text-seg.pdf to the UTC, that UAX#29 "Unicode Text Segmentation" adopt a rather naïve definition of an Indian orthographic syllable. The worst outcome in my opinion would be if it were adopted for the extended grapheme cluster definition - it would make editing orthographic clusters even more difficult. However, it would make sense for CLDR to carry localised definitions. For layout, the definition would be relevant for 'drop capital effects' and for the analogue of inserting spaces between letters. There are recommendations in a maturing W3C specification for Indic layout, though to be fair the specification fairly quickly restricts its scope to Indian scripts. Now, if the spacing were applied to the Nepali word श्रीमान्को I would expect to see something like श्री मा न् को, as the base word itself would appear as श्री मा न् when subjected to the same treatment. However, before suggesting minor improvements that might be in order, I thought I should check whether there was agreement that terminated an orthographic syllable. It now seems that any general agreement would in fact be that it did *not* terminate an orthographic syllable! I must say that stretching श्रीमान्को out as श्री मा न्को feels wrong. If my feeling is right, then the definition of orthographic syllable, if it can be done without reference to a font, belongs in CLDR, as UAX#29 implies, and not in the Unicode Character Database and Unicode standards. Richard.
Re: Counting Devanagari Aksharas
I don't think there's consensus. When given a rendered representation people seem to uniformly count conjuncts as multiple aksharas if rendered with visible halant, and as a single akshara if they are rendered conjoined. Most fonts for devanagari these days are pretty good at conjoining consonants. They seem to do so for all common conjuncts, and usually for most practical (i.e. not ridiculously long) conjuncts. I've never seen a visible halant in text I've read. I'm of the opinion that Unicode should start considering devanagari (and possibly other indic) consonant clusters as single extended grapheme clusters. Yes, sometimes it's not rendered as a single glyph, but sometimes family emoji will not render as a single glyph either (if you use skin tones or more than 4 family members) and we still consider those EGCs. -Manish On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicodewrote: > Is there consensus on how to count aksharas in the Devanagari script? > The doubts I have relate to a visible halant in orthographic syllables > other than the first. > > For example, according to 'Devanagari VIP Team Issues Report' > http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a > derived form from Nepali श्रीमान् should be written श्रीमान्को > DEVANAGARI LETTER RA, U+0940 DEVANAGARI VOWEL SIGN II, U+092E > DEVANAGARI LETTER MA, U+093E DEVANAGARI VOWEL SIGN AA, U+0928 > DEVANAGARI LETTER NA, U+094D, U+200C ZERO WIDTH NON-JOINER, U+0915 > DEVANAGARI LETTER KA, U+094B DEVANAGARI VOWEL SIGN O> and not > श्रीमान्को U+094D, U+0915, U+094B>. Now, if the font used has a conjunct for > SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO > and the latter as having 3 aksharas SH.RII, MAA, N.KO. > > If the font leads to the use of a visible halant instead of the vattu > conjunct SH.RA, as happens when I view this email, would there then be > 5 and 4 aksharas respectively? A further complication is that the font > chosen treats what looks like SH, RA as a conjunct; the vowel I appears > to the left of SH when added after RA (श्रि). > > Richard. >
Re: Counting Devanagari Aksharas
Hello Richard. Yes my earlier reply wasn't intended to be offlist. I have near-zero knowledge about non-Indic languages. All I can say is that Tamil script has eschewed most consonant cluster ligatures/conjoining forms. As for Devanagari, writing श्रीमान्को (I used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology. The latter would be Sanskrit orthography and former perhaps Hindi, although I wouldn't know why anyone would want to run in the को with the preceding श्रीमान् even in Hindi. And IMO it would be better to clearly define at the outset what you meant by "akshara" in your question to avoid confusions by people replying having a different idea of the meaning of that term. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
Re: Counting Devanagari Aksharas
I was offered the following reply: > To my knowledge except in Tamil script vowel less consonants in > written form aren't considered as separate "akshara"s in native > terminology. Word-finally they seem to be being treated as such. To be more precise, a final cluster of one or more consonants marked as having no vowel is - Sanskrit has a few word-final clusters. > However for text shaping purposes they will surely have > to be considered as separate orthographic syllables in Unicode > terminology since in word end position they can sometimes carry svara > markers. The complication comes word internally. My understanding is that phonetically syllable-final consonants in non-Indic words in non-Indic languages have a tendency not to be included in an akshara along with the start of the next syllable. However, that tendency is more evident in scripts other than Devanagari; Devanagari has developed in the context of Indic languages. Renderers' syllable-recognition algorithms will naturally treat word-final devowelled sequences as separate units, rather than associate them with the previous implicit or explict vowel. Burmese is a good example of what can happen with a non-Indic language; in native words, phonetic syllabic boundaries tend to be orthographic syllable boundaries. Text-shaping engines like Microsoft's Uniscribe are more complicated. For scripts with a virama, they seem to assume that the virama may be a combining operator, and wait for data from the font to decide how many clusters to form. One test is the insertion of white spaces in a word when it is stretched out. Of course, that test can only be applied where human decisions are involved - otherwise we are just looking at what dominant renderers are actually doing, rather than looking at what they ought to be doing. Richard.