Re: Counting Devanagari Aksharas

2017-04-26 Thread Eli Zaretskii via Unicode
> Date: Wed, 26 Apr 2017 07:45:07 +0100
> From: Richard Wordingham via Unicode 
> 
> On Wed, 26 Apr 2017 08:48:13 +0300
> Eli Zaretskii via Unicode  wrote:
> 
> > > Date: Sun, 23 Apr 2017 22:59:49 +0100
> > > From: Richard Wordingham 
> > > Cc: Eli Zaretskii 
> > > 
> > > If I search for CGJ, highlighting it is frequently supremely
> > > useless. I want to know where it is; highlighting is merely a tool
> > > to find it on the screen.  
> > 
> > So I guess this means highlighting is useful after all ;-)
> 
> ᩺Not if the area highlit is zero pixels wide.

If you elide too much of the context, the discussion could lose all of
its meaning.  Let me restore some of the relevant context:

> > > > > On 2017-04-22, Eli Zaretskii via Unicode  wrote:
> > > > 
> > > > > > I could imagine Emacs decomposing characters temporarily when only
> > > > > > part of a cluster matches the search string.  Assuming this would
> > > > > > make sense to users of some complex scripts, that is.  You are
> > > > > > welcome to suggest such a feature by using report-emacs-bug.  
> > > > 
> > > > The cursor moves to the cluster boundary, so there is much less of a
> > > > problem with Emacs.
> > > 
> > > But you wanted to highlight only part of the cluster, AFAIU.
> > 
> > If I search for CGJ, highlighting it is frequently supremely useless.
> > I want to know where it is; highlighting is merely a tool to find it on
> > the screen.
> 
> So I guess this means highlighting is useful after all ;-)

IOW, the context was a suggestion to temporarily disable character
composition, in which case CGJ _will_ be displayed as non-zero width
glyph, at least in the default Emacs display configuration, and CGJ
_will_ be visible with its highlight.


Re: Counting Devanagari Aksharas

2017-04-26 Thread Richard Wordingham via Unicode
On Wed, 26 Apr 2017 08:48:13 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 23 Apr 2017 22:59:49 +0100
> > From: Richard Wordingham 
> > Cc: Eli Zaretskii 
> > 
> > If I search for CGJ, highlighting it is frequently supremely
> > useless. I want to know where it is; highlighting is merely a tool
> > to find it on the screen.  
> 
> So I guess this means highlighting is useful after all ;-)

᩺Not if the area highlit is zero pixels wide.

Richard.



Re: Counting Devanagari Aksharas

2017-04-25 Thread Eli Zaretskii via Unicode
> Date: Sun, 23 Apr 2017 22:59:49 +0100
> From: Richard Wordingham 
> Cc: Eli Zaretskii 
> 
> If I search for CGJ, highlighting it is frequently supremely useless.
> I want to know where it is; highlighting is merely a tool to find it on
> the screen.

So I guess this means highlighting is useful after all ;-)


Re: Go romanize! Re: Counting Devanagari Aksharas

2017-04-25 Thread Naena Guru via Unicode

Quote from below:

The word indeed means 'danger' (Pali/Sanskrit _antarāya_).  The
pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai
Tham script no longer have /r/.  The older sequence /tr/ normally
became /tʰ/ (except in Lao), but the spelling has not been updated - at
least, not amongst the more literate.  The script has a special symbol
for the short vowel /o/, which it shares with the Lao script.  This
symbol is used in writing that word.  Two ways I have seen it spelt,
each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second
syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy.  I have also seen a
form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy.
However, I have seen nothing that shows that I won't encounter
ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even
ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives?

Response:
Perhaps this word is derived from Sanskrit 'anþaraða'
(Search: antarada at 
http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche)

Sinhala:anþaraaðaayakayi, anþaraava, anþaraavayi, anþraava, anþraavayi Use this 
font to read the above Sinhala words: http://smartfonts.net/ttf/aruna.ttf


-=- svasþi siððham! -=-


On 4/25/2017 2:07 AM, Richard Wordingham via Unicode wrote:


On Mon, 24 Apr 2017 20:53:12 +0530
Naena Guru via Unicode  wrote:


Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks ofhttp://wrdingam.co.uk/lanna/denderer_test.htm.)  For
example, there are several different ways of writing what one might
naively record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and
Sanskrit grammar. The akSara has two aspects, its sound (zabða,
phoneme) and its shape. (letter, ruupa). Reduce the writing system to
its consonants, vowels etc. (zabða) and assign SBCS letters/codes to
them (ruupa). SBCS provides the best technical facilities for any
language. (This is why now more than 130 languages romanize despite
Unicode). Use English letters for similar sounds in the native
speech. Now, treat all combinations as ligatures. For example, 'po'
sound in Indic has the p consonant with a sign ahead plus a sign
after.

In many Indic scripts, yes.  In Devanagari, the vowel sign is normally
a singly element classified as following the consonant.  In Thai, the
vowel sign precedes the consonant.  Tai Tham uses both a two-part sign
and a preceding sign.  The preceding sign is for Tai words and the
two-part sign for Pali words, but loanwords from Pali into the Tai
languages may retain the two part sign.


For the font, there is no difference between the way it makes
the combination 'ä', which has a sign above and the Indic having two
on either side.

For OpenType, there is.  The first can be made by providing a
simple table of where the diaeresis goes relative to the base
characters, in this case the diaeresis.  The second is painfully
complicated, for the 'p' may have other marks attached to it, so doing
it be relative positioning is painfully complicated and error-prone.
This job is given to the rendering engine, which may introduce its own
problems.

AAT and Graphite offer the font maker the ability to move the 'sign
ahead' from after the 'p' to before it.


Recall that long ago, Unicode stopped defining fixed
ligatures and asked the font makers to define them in the PUA.

While the first is true enough, I believe the second is false.  Not
every glyph has to be mapped to by a single character.  I don't do that
for contextual forms or ligatures in my font.


Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I
have observed. Like in English and Tamil, Hindi tends to end words
with a consonant. So, there is this habit among the Hindi speakers to
drop the ending vowel, mostly 'a' from words that actually end with
it. For example, the famous name Jayantha (miserable mine too, haha!
= jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It
is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel
ending and are traditionally spoken as such.

This loss is also to be found in Further India.  Thai, Lao and Khmer
now require that such a word-final vowel be written explicitly if it is
still pronounced.


Looking at the word you gave, ontarAy, it looks to me like an
Anglicized form. If I am to make a guess, its ending is like in
ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I
am right, this is a good example of decline if a writing system owing
to bad, uncaring application of technology. We are in the Digital
Age, and we need not compromise any more. In fact, we can fix errors
and decadence introduced by past technologies.

The word 

Re: Go romanize! Re: Counting Devanagari Aksharas

2017-04-24 Thread Richard Wordingham via Unicode
On Mon, 24 Apr 2017 20:53:12 +0530
Naena Guru via Unicode  wrote:

> Quote by Richard:
> Unless this implies a spelling reform for many languages, I'd like to
> see how this works for the Tai Tham script.  I'm not happy with the
> Romanisation I use to work round hostile rendering engines.  (My
> scheme is only documented in variable hack_ss02 in the last script
> blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For
> example, there are several different ways of writing what one might
> naively record as "ontarAy".
> 
> MY RESPONSE:
> Richard, I stuck to the two specifications (Unicode and Font) and
> Sanskrit grammar. The akSara has two aspects, its sound (zabða,
> phoneme) and its shape. (letter, ruupa). Reduce the writing system to
> its consonants, vowels etc. (zabða) and assign SBCS letters/codes to
> them (ruupa). SBCS provides the best technical facilities for any
> language. (This is why now more than 130 languages romanize despite
> Unicode). Use English letters for similar sounds in the native
> speech. Now, treat all combinations as ligatures. For example, 'po'
> sound in Indic has the p consonant with a sign ahead plus a sign
> after.

In many Indic scripts, yes.  In Devanagari, the vowel sign is normally
a singly element classified as following the consonant.  In Thai, the
vowel sign precedes the consonant.  Tai Tham uses both a two-part sign
and a preceding sign.  The preceding sign is for Tai words and the
two-part sign for Pali words, but loanwords from Pali into the Tai
languages may retain the two part sign.

> For the font, there is no difference between the way it makes
> the combination 'ä', which has a sign above and the Indic having two
> on either side.

For OpenType, there is.  The first can be made by providing a
simple table of where the diaeresis goes relative to the base
characters, in this case the diaeresis.  The second is painfully
complicated, for the 'p' may have other marks attached to it, so doing
it be relative positioning is painfully complicated and error-prone.
This job is given to the rendering engine, which may introduce its own
problems.

AAT and Graphite offer the font maker the ability to move the 'sign
ahead' from after the 'p' to before it.

> Recall that long ago, Unicode stopped defining fixed
> ligatures and asked the font makers to define them in the PUA.

While the first is true enough, I believe the second is false.  Not
every glyph has to be mapped to by a single character.  I don't do that
for contextual forms or ligatures in my font.

> Spelling and speech:
> There is indeed a confusion about writing and reading in Hindi, as I
> have observed. Like in English and Tamil, Hindi tends to end words
> with a consonant. So, there is this habit among the Hindi speakers to
> drop the ending vowel, mostly 'a' from words that actually end with
> it. For example, the famous name Jayantha (miserable mine too, haha!
> = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It
> is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel
> ending and are traditionally spoken as such.

This loss is also to be found in Further India.  Thai, Lao and Khmer
now require that such a word-final vowel be written explicitly if it is
still pronounced.

> Looking at the word you gave, ontarAy, it looks to me like an
> Anglicized form. If I am to make a guess, its ending is like in
> ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I
> am right, this is a good example of decline if a writing system owing
> to bad, uncaring application of technology. We are in the Digital
> Age, and we need not compromise any more. In fact, we can fix errors
> and decadence introduced by past technologies.

The word indeed means 'danger' (Pali/Sanskrit _antarāya_).  The
pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai
Tham script no longer have /r/.  The older sequence /tr/ normally
became /tʰ/ (except in Lao), but the spelling has not been updated - at
least, not amongst the more literate.  The script has a special symbol
for the short vowel /o/, which it shares with the Lao script.  This
symbol is used in writing that word.  Two ways I have seen it spelt,
each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second
syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy.  I have also seen a
form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy.
However, I have seen nothing that shows that I won't encounter
ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even
ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives?

Richard.



Go romanize! Re: Counting Devanagari Aksharas

2017-04-24 Thread Naena Guru via Unicode

Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit 
grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. 
(letter, ruupa). Reduce the writing system to its consonants, vowels etc. 
(zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best 
technical facilities for any language. (This is why now more than 130 languages 
romanize despite Unicode). Use English letters for similar sounds in the native 
speech. Now, treat all combinations as ligatures. For example, 'po' sound in 
Indic has the p consonant with a sign ahead plus a sign after. For the font, 
there is no difference between the way it makes the combination 'ä', which has 
a sign above and the Indic having two on either side. Recall that long ago, 
Unicode stopped defining fixed ligatures and asked the font makers to define 
them in the PUA.

Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I have 
observed. Like in English and Tamil, Hindi tends to end words with a consonant. 
So, there is this habit among the Hindi speakers to drop the ending vowel, 
mostly 'a' from words that actually end with it. For example, the famous name 
Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced 
Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like 
Sinhhala have vowel ending and are traditionally spoken as such.

Dictionary is a commercial invention. When Caxton brought lead types to 
England, French-speaking Latin-flaunting elites did not care about the poor 
natives. Earlier, invading Romans forced them to drop Fuþark and adopt the 
22-letter Latin alphabet. So, they improvised. Struck a line across d and made 
ð, Eth; added a sign to 'a' and made æ (Asc) and continued using Thorn (þ) by 
rounding the loop. Lead type printing hit English for the second time, ruining 
it as the spell standardizing began. Dictionaries sold. THE POWERFUL CAN RUIN 
PEOPLE'S PROPERTY BECAUSE THEY CAN IN ORDER TO MAKE MONEY. Unicode enthusiasts, 
take heed!

Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. 
If I am to make a guess, its ending is like in ontarAyi. Is it said something 
like, own-the-raa-yi? (danger?) If I am right, this is a good example of 
decline if a writing system owing to bad, uncaring application of technology. 
We are in the Digital Age, and we need not compromise any more. In fact, we can 
fix errors and decadence introduced by past technologies.


RICHARD:
That sounds like a letter-assembly system.

MY RESPONSE:
Nothing assembled there, my friend.



On 4/24/2017 12:38 PM, Richard Wordingham via Unicode wrote:

On Mon, 24 Apr 2017 00:36:26 +0530
Naena Guru via Unicode  wrote:


The Unicode approach to Sanskrit and all Indic is flawed. Indic
should not be letter-assembly systems.

Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of
the speech. Each writing system then assigns a shape to the
phonetically precise phoneme.

The most technically and grammatically proper solution for Indic is
first to ROMANIZE the group of writing systems at the level of
phonemes. That is, assign romanized shapes to vowels, consonants,
prenasals, post-vowel phonemes (anusvara and visarjaniiya with its
allophones) etc. This approach is similar to how European languages
picked up Latin, improvised the script and even uses Simples and
Capitals repertoire. Romanizing immediately makes typing easier and
eliminates sometimes embarrassing ambiguity in Anglicizing -- you
type phonetically on key layouts close to QWERTY. (Only four
positions are different in Romanized Sinhala layout).

If we drop the capitalizing rules and utilize caps to indicate the
'other' forms of a common letter, we get an intuitively typed system
for each language, and readable too. When this is done carefully,
comparing phoneme sets of the languages, we can reach a common set of
Latin-derived SINGLE-BYTE letters completely covering all phonemes of
all Indic.

Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".


Next, each native script can be obtained by making 

Re: Counting Devanagari Aksharas

2017-04-24 Thread Richard Wordingham via Unicode
On Mon, 24 Apr 2017 00:36:26 +0530
Naena Guru via Unicode  wrote:

> The Unicode approach to Sanskrit and all Indic is flawed. Indic
> should not be letter-assembly systems.
> 
> Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of
> the speech. Each writing system then assigns a shape to the
> phonetically precise phoneme.
> 
> The most technically and grammatically proper solution for Indic is 
> first to ROMANIZE the group of writing systems at the level of
> phonemes. That is, assign romanized shapes to vowels, consonants,
> prenasals, post-vowel phonemes (anusvara and visarjaniiya with its
> allophones) etc. This approach is similar to how European languages
> picked up Latin, improvised the script and even uses Simples and
> Capitals repertoire. Romanizing immediately makes typing easier and
> eliminates sometimes embarrassing ambiguity in Anglicizing -- you
> type phonetically on key layouts close to QWERTY. (Only four
> positions are different in Romanized Sinhala layout).
> 
> If we drop the capitalizing rules and utilize caps to indicate the 
> 'other' forms of a common letter, we get an intuitively typed system
> for each language, and readable too. When this is done carefully,
> comparing phoneme sets of the languages, we can reach a common set of 
> Latin-derived SINGLE-BYTE letters completely covering all phonemes of 
> all Indic.

Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".

> Next, each native script can be obtained by making orthographic smart 
> fonts that display the SBCS codes in the respective shapes of the
> native scripts.

That sounds like a letter-assembly system.

So how does your scheme help one split words into orthographic
syllables?

> I have successfully romanized Sinhala and revived the full repertoire
> of Sinhla + Sanskrit orthography losing nothing. Sinhala script is
> perhaps the most complex of all Indic because it is used to write
> both Sanskrit and Pali.

What complication does Pali impose on top of Sanskrit.  As far as I'm
aware, it just needs one extra letter, usually called LLA, which you
will already have if 'Sanskrit' includes Vedic Sanskrit.
 
> See this: http://ahangama.com/ (It's all SBCS underneath).
> Test here: http://ahangama.com/edit.htm

All I get for these are blank pages.  Perhaps there's an unreported
communication failure in the network,

Richard.


Re: Counting Devanagari Aksharas

2017-04-23 Thread Richard Wordingham via Unicode
On Sun, 23 Apr 2017 05:40:29 +0300
Eli Zaretskii via Unicode  wrote:

> > The cursor moves to the cluster boundary, so there is much less of a
> > problem with Emacs.  
> 
> But you wanted to highlight only part of the cluster, AFAIU.

If I search for CGJ, highlighting it is frequently supremely useless.
I want to know where it is; highlighting is merely a tool to find it on
the screen.

Richard.


Re: Counting Devanagari Aksharas

2017-04-23 Thread Naena Guru via Unicode
The Unicode approach to Sanskrit and all Indic is flawed. Indic should 
not be letter-assembly systems.


Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the 
speech. Each writing system then assigns a shape to the phonetically 
precise phoneme.


The most technically and grammatically proper solution for Indic is 
first to ROMANIZE the group of writing systems at the level of phonemes. 
That is, assign romanized shapes to vowels, consonants, prenasals, 
post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. 
This approach is similar to how European languages picked up Latin, 
improvised the script and even uses Simples and Capitals repertoire. 
Romanizing immediately makes typing easier and eliminates sometimes 
embarrassing ambiguity in Anglicizing -- you type phonetically on key 
layouts close to QWERTY. (Only four positions are different in Romanized 
Sinhala layout).


If we drop the capitalizing rules and utilize caps to indicate the 
'other' forms of a common letter, we get an intuitively typed system for 
each language, and readable too. When this is done carefully, comparing 
phoneme sets of the languages, we can reach a common set of 
Latin-derived SINGLE-BYTE letters completely covering all phonemes of 
all Indic.


Next, each native script can be obtained by making orthographic smart 
fonts that display the SBCS codes in the respective shapes of the native 
scripts.


I have successfully romanized Sinhala and revived the full repertoire of 
Sinhla + Sanskrit orthography losing nothing. Sinhala script is perhaps 
the most complex of all Indic because it is used to write both Sanskrit 
and Pali.


See this: http://ahangama.com/ (It's all SBCS underneath).
Test here: http://ahangama.com/edit.htm


On 4/20/2017 5:05 AM, Richard Wordingham via Unicode wrote:

Is there consensus on how to count aksharas in the Devanagari script?
The doubts I have relate to a visible halant in orthographic syllables
other than the first.

For example, according to 'Devanagari VIP Team Issues Report'
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
 and not
श्रीमान्को  .  Now, if the font used has a conjunct for
SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
and the latter as having 3 aksharas SH.RII, MAA, N.KO.

If the font leads to the use of a visible halant instead of the vattu
conjunct SH.RA, as happens when I view this email, would there then be
5 and 4 aksharas respectively?  A further complication is that the font
chosen treats what looks like SH, RA as a conjunct; the vowel I appears
to the left of SH when added after RA (श्रि).

Richard.





Re: Counting Devanagari Aksharas

2017-04-23 Thread Asmus Freytag via Unicode

  
  
On 4/22/2017 9:25 PM, Manish Goregaokar
  via Unicode wrote:


  Backspace in browsers (chrome and firefox) deletes within EGCs too.
They delete matras in devanagari, and jamos in hangul. They don't
*exactly* work off of code points (e.g. flag emoji gets deleted as a
whole in many backspace implementations)

Flag emoji and many other "invisible" sequences
are different from ligatures and conjuncts in one important way:
their elements are not usually key strokes, but the full
sequence would be inserted from a pick list or other type of
input method. If you didn't "type" each of the elements of the
sequence, then deleting individual ones is something you would
only need for debugging or other specialized purposes, not for
undoing a physical action (keystroke) in reverse order.
Speaking of undoing: not all editors always
support full key-stroke by key-stroke undo, some will coalesce
longer runs of text. This saves on space for the undo buffer,
but also makes undoing more extensive edits less painful. It's
clearly a personal preference whether such "streamlining" would
feel "right" or "bothersome".
Beyond the last line typed, or two, I may really
not care if undo went word by word, say.
A./

  



Re: Counting Devanagari Aksharas

2017-04-22 Thread Manish Goregaokar via Unicode
> You cannot even
> meaningfully move by single characters in most clusters, because
> composing characters generally completely changes how the original
> characters looked, so there's nowhere you can display the cursor.

Yes, and this is one of the reasons it feels broken in devanagari, you
get cursors in the midst of aksharas, in weird places.


Backspace in browsers (chrome and firefox) deletes within EGCs too.
They delete matras in devanagari, and jamos in hangul. They don't
*exactly* work off of code points (e.g. flag emoji gets deleted as a
whole in many backspace implementations)
-Manish


On Sat, Apr 22, 2017 at 12:22 PM, Eli Zaretskii via Unicode
 wrote:
>> Date: Sat, 22 Apr 2017 17:13:36 +0100
>> From: Richard Wordingham via Unicode 
>>
>> > Movement by grapheme
>> > cluster is AFAIK the most natural way of moving in complex scripts.
>>
>> Evidence?
>
> Personal experience?
>
>> It's easiest for displaying the cursor.
>
> It's the _only_ way of displaying the cursor.  You cannot even
> meaningfully move by single characters in most clusters, because
> composing characters generally completely changes how the original
> characters looked, so there's nowhere you can display the cursor.  And
> without being able to position the cursor, a visual feedback to the
> user becomes troublesome at best.
>
>> I've encountered the problem that, while at least I can search for
>> text smaller than a cluster, there's no indication in the window of
>> where in the window the text is.
>
> I could imagine Emacs decomposing characters temporarily when only
> part of a cluster matches the search string.  Assuming this would make
> sense to users of some complex scripts, that is.  You are welcome to
> suggest such a feature by using report-emacs-bug.
>
>> SIL's Graphite supports the idea of a split cursor, which
>> shows the glyphs corresponding to the characters before and after the
>> cursor position.
>
> I find split-cursor to be a nuisance, FWIW.  IME, it confuses the
> users without making anything much clearer.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sun, 23 Apr 2017 00:51:59 +0100
> Cc: Julian Bradfield 
> From: Richard Wordingham via Unicode 
> 
> On Sat, 22 Apr 2017 21:39:42 +0100 (BST)
> Julian Bradfield via Unicode  wrote:
> 
> > On 2017-04-22, Eli Zaretskii via Unicode  wrote:
> 
> > > I could imagine Emacs decomposing characters temporarily when only
> > > part of a cluster matches the search string.  Assuming this would
> > > make sense to users of some complex scripts, that is.  You are
> > > welcome to suggest such a feature by using report-emacs-bug.  
> 
> The cursor moves to the cluster boundary, so there is much less of a
> problem with Emacs.

But you wanted to highlight only part of the cluster, AFAIU.

> > That's what I do in my emacs with combining characters, and if I had
> > complex script support, I'd expect the same to happen there.
> > emacs is a programmer's editor, after all :)
> 
> Emacs probably has a way of toggling complex script support somewhere.
> I'm torn between seeing the text properly set out and seeing exactly
> what it is that I've typed.  'Reveal codes' doesn't seem widely
> supported.

"M-x auto-composition-mode RET" should do what you want.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Sat, 22 Apr 2017 21:39:42 +0100 (BST)
Julian Bradfield via Unicode  wrote:

> On 2017-04-22, Eli Zaretskii via Unicode  wrote:

> > I could imagine Emacs decomposing characters temporarily when only
> > part of a cluster matches the search string.  Assuming this would
> > make sense to users of some complex scripts, that is.  You are
> > welcome to suggest such a feature by using report-emacs-bug.  

The cursor moves to the cluster boundary, so there is much less of a
problem with Emacs.

> That's what I do in my emacs with combining characters, and if I had
> complex script support, I'd expect the same to happen there.
> emacs is a programmer's editor, after all :)

Emacs probably has a way of toggling complex script support somewhere.
I'm torn between seeing the text properly set out and seeing exactly
what it is that I've typed.  'Reveal codes' doesn't seem widely
supported.

Richard.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Julian Bradfield via Unicode
On 2017-04-22, Eli Zaretskii via Unicode  wrote:
>> From: Richard Wordingham via Unicode 
[...]
>> I've encountered the problem that, while at least I can search for
>> text smaller than a cluster, there's no indication in the window of
>> where in the window the text is.
>
> I could imagine Emacs decomposing characters temporarily when only
> part of a cluster matches the search string.  Assuming this would make
> sense to users of some complex scripts, that is.  You are welcome to
> suggest such a feature by using report-emacs-bug.

That's what I do in my emacs with combining characters, and if I had
complex script support, I'd expect the same to happen there.
emacs is a programmer's editor, after all :)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sat, 22 Apr 2017 17:13:36 +0100
> From: Richard Wordingham via Unicode 
> 
> > Movement by grapheme
> > cluster is AFAIK the most natural way of moving in complex scripts.
> 
> Evidence?

Personal experience?

> It's easiest for displaying the cursor.

It's the _only_ way of displaying the cursor.  You cannot even
meaningfully move by single characters in most clusters, because
composing characters generally completely changes how the original
characters looked, so there's nowhere you can display the cursor.  And
without being able to position the cursor, a visual feedback to the
user becomes troublesome at best.

> I've encountered the problem that, while at least I can search for
> text smaller than a cluster, there's no indication in the window of
> where in the window the text is.

I could imagine Emacs decomposing characters temporarily when only
part of a cluster matches the search string.  Assuming this would make
sense to users of some complex scripts, that is.  You are welcome to
suggest such a feature by using report-emacs-bug.

> SIL's Graphite supports the idea of a split cursor, which
> shows the glyphs corresponding to the characters before and after the
> cursor position.

I find split-cursor to be a nuisance, FWIW.  IME, it confuses the
users without making anything much clearer.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Sat, 22 Apr 2017 13:34:32 +0300
Eli Zaretskii via Unicode  wrote:

> AFAIR, Emacs allows one to _delete_ individual characters,
> i.e. Backspace and C-d delete character-by-character, so the problem
> shouldn't be so grave for imperfect typists.

Deleting forwards by one _character_ certainly makes life less harsh.
It's pleasanter than the UAX#29 suggestion, "For example, on a given
system the backspace key might delete by code point, while the delete
key may delete an entire cluster".

> Movement by grapheme
> cluster is AFAIK the most natural way of moving in complex scripts.

Evidence?  It's easiest for displaying the cursor.  I've encountered the
problem that, while at least I can search for text smaller than a
cluster, there's no indication in the window of where in the window the
text is.  SIL's Graphite supports the idea of a split cursor, which
shows the glyphs corresponding to the characters before and after the
cursor position.

Richard.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sat, 22 Apr 2017 11:13:16 +0100
> From: Richard Wordingham via Unicode 
> 
> At present these are split into two and three grapheme clusters
> respectively, and LibreOffice cursor movement responds accordingly.
> (SIGN AA starts a grapheme cluster in several scripts of further
> India.)  However, if one teaches the Emacs editor what a Tai Tham
> syllable is, so that it can use the M17n rendering library, the cursor
> then advances syllable by syllable, which is unpleasant for imperfect
> typists.

AFAIR, Emacs allows one to _delete_ individual characters,
i.e. Backspace and C-d delete character-by-character, so the problem
shouldn't be so grave for imperfect typists.  Movement by grapheme
cluster is AFAIK the most natural way of moving in complex scripts.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Fri, 21 Apr 2017 16:27:43 -0700
Manish Goregaokar via Unicode  wrote:

> > Do Hindi speakers really think of orthographic syllables as
> > characters?  
> 
> When rendered as a cluster, yes? I've asked around, and folks seem to
> insist on coupling it to the rendering.

That argues that it's a unit, which I don't think is in dispute.  Words
are also units, and nowadays we don't normally insist that one retype a
word just to change one bit of it.

> Given most fonts render
> *normal* (common, etc) clusters, I think making them EGCs and looking
> at nonrendered clusters the same way we do family emoji is fine
> (family emojis of length 5 are a single EGC, but that's not what's
> actually perceived by the user, but it's a use case that's very rare
> in the wild, so it doesn't matter).

That depends on the language.  In the Tai Tham script, even without
consonant clusters one can get 5 graphic characters in a syllable,
e.g. ᨧᩮᩢ᩶ᩣ _cao_  'lord;
you (polite)', and when one adds consonant clusters one easily gets
monosyllables like ᨠᩖ᩠᩶ᩅ᩠ᨿ _kluai_  'banana' with 5 graphic characters and
additionally 2 coengs.  (One can distinguish Pali from the Tai
languages simply by the density of the ink!)

At present these are split into two and three grapheme clusters
respectively, and LibreOffice cursor movement responds accordingly.
(SIGN AA starts a grapheme cluster in several scripts of further
India.)  However, if one teaches the Emacs editor what a Tai Tham
syllable is, so that it can use the M17n rendering library, the cursor
then advances syllable by syllable, which is unpleasant for imperfect
typists.  Fortunately, it's possible to add functions to Emacs to allow
it to advance character-by-character; I forget if one has to also add a
few code changes.  (The downside is that text either side of the cursor
is rendered independently, which can be a nuisance when editing very
long lines.)

> The way I see it, the current
> system is wrong, and so would the proposed system of not breaking at
> viramas (or not breaking at viramas followed by a consonant if we want
> to be more precise), but the proposed system would be wrong much less
> often.
 
> I am only talking about Devanagari, though scripts like
> Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems
> sensible.

Indeed, viramas (InSC=Virama) will have to be handled case-by-case.  One
should continue to break after pulli (U+0BCD TAMIL SIGN VIRAMA) except
for the cases of the ligatures/conjuncts.  I don't know if there are
obscure cases, or whether it's only _shri_ and  for which one
should not break just because of the virama.  Continuation after coengs
(InSC=Invisible_Stacker) should be automatic.

Malayalam will need customisation.  Definitions by codepoints are only
a fallback, for when a font cannot be used to guide the process. 

Formally, normalisation is a problem, as these characters can be
separated from letters by other marks.  This is a problem in practice
for normalised text in Tai Tham.

Pure killers (InSC=Pure_Killer) should probably be given no special
treatment, as at present, by default, though I wonder if we should
define orthographic syllables for Pali in Thai script.  The two
orthographies will need different rules, and renderers won't help.
Defining orthographic syllables for languages in the Latin script is
probably excessive.

Richard.



Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
> Do Hindi speakers really think of orthographic syllables as characters?

When rendered as a cluster, yes? I've asked around, and folks seem to
insist on coupling it to the rendering. Given most fonts render
*normal* (common, etc) clusters, I think making them EGCs and looking
at nonrendered clusters the same way we do family emoji is fine
(family emojis of length 5 are a single EGC, but that's not what's
actually perceived by the user, but it's a use case that's very rare
in the wild, so it doesn't matter). The way I see it, the current
system is wrong, and so would the proposed system of not breaking at
viramas (or not breaking at viramas followed by a consonant if we want
to be more precise), but the proposed system would be wrong much less
often.

I am only talking about Devanagari, though scripts like
Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems
sensible.
-Manish


On Fri, Apr 21, 2017 at 4:04 PM, Richard Wordingham via Unicode
 wrote:
> On Thu, 20 Apr 2017 11:17:05 -0700
> Manish Goregaokar via Unicode  wrote:
>
>> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
>>  wrote:
>
>> > Is there consensus on how to count aksharas in the Devanagari
>> > script? The doubts I have relate to a visible halant in
>> > orthographic syllables other than the first.
>
>> I don't think there's consensus.
>
> I've found related discussion at
> https://lists.w3.org/Archives/Public/public-i18n-indic/.  The question
> of how to count was raised and not answered there.
>
>> On Wed, Apr 19, 2017 at 4:35 PM,
>> Richard Wordingham via Unicode  wrote:
>> > Is there consensus on how to count aksharas in the Devanagari
>> > script? The doubts I have relate to a visible halant in
>> > orthographic syllables other than the first.
>
>> I'm of the opinion that Unicode should start considering devanagari
>> (and possibly other indic) consonant clusters as single extended
>> grapheme clusters.
>
> Do Hindi speakers really think of orthographic syllables as characters?
>
> What may be useful is the concept of a definition of an orthographic
> syllable.  It may be possible to get the information from a font -
> depending on the renderer - but a locale-dependent definition should be
> possible for use as a fall-back.  Devanagari rules won't work for
> Tamil, and I think rules for Hindi and Nepali will be slightly
> different -  looks like a problem.
>
> The concept is possibly not useful in some Indic scripts - the concept
> won't work well in Thai, but will work in Pali in the Thai script, for
> both Pali orthographies.
>
> Richard.


Re: Counting Devanagari Aksharas

2017-04-21 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 11:17:05 -0700
Manish Goregaokar via Unicode  wrote:

> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
>  wrote:

> > Is there consensus on how to count aksharas in the Devanagari
> > script? The doubts I have relate to a visible halant in
> > orthographic syllables other than the first.

> I don't think there's consensus.

I've found related discussion at
https://lists.w3.org/Archives/Public/public-i18n-indic/.  The question
of how to count was raised and not answered there.

> On Wed, Apr 19, 2017 at 4:35 PM,
> Richard Wordingham via Unicode  wrote:
> > Is there consensus on how to count aksharas in the Devanagari
> > script? The doubts I have relate to a visible halant in
> > orthographic syllables other than the first.

> I'm of the opinion that Unicode should start considering devanagari
> (and possibly other indic) consonant clusters as single extended
> grapheme clusters.

Do Hindi speakers really think of orthographic syllables as characters?

What may be useful is the concept of a definition of an orthographic
syllable.  It may be possible to get the information from a font -
depending on the renderer - but a locale-dependent definition should be
possible for use as a fall-back.  Devanagari rules won't work for
Tamil, and I think rules for Hindi and Nepali will be slightly
different -  looks like a problem.

The concept is possibly not useful in some Indic scripts - the concept
won't work well in Thai, but will work in Pali in the Thai script, for
both Pali orthographies.

Richard.


Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
That seems like a relatively niche use case (especially with Vedic
Sanskrit) compared to having weird selection for everything else. I'm
not convinced. When I use a romanized Devanagari input method (I
typically do on my laptop), deleting the whole cluster is necessary
anyway for things to work well. Direct input methods do let you edit
in a more granular way but I've never seen the need for that.

I guess this boils down to a matter of opinion and anecdotal
experience, so there's not much I can do to convince this list
otherwise :)

-Manish


On Fri, Apr 21, 2017 at 12:23 AM, Richard Wordingham via Unicode
 wrote:
> On Fri, 21 Apr 2017 00:08:24 -0500
> Anshuman Pandey via Unicode  wrote:
>
>> > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode
>> >  wrote:
>
>> > Now imagine you're
>> > typing Vedic Sanskrit, with its clusters and pitch indicators.
>
>> I tried typing Vedic Sanskrit, and it seems to work:
>
>> http://pandey.pythonanywhere.com/devsyll
>
> That should demonstrate nothing relevant if you type correctly first
> time.  The issue comes when you mistype and have to correct, to give
> the usual worst case, the first letter of a conjunct.  Now, I looked at
> your page in Firefox on Ubuntu, and I found the cursor seemed to move
> by extended grapheme cluster.  That means that to change a consonant
> you have to retype the following marks.
>
> I did find two issues with your analyser.
>
> Firstly, it broke श्रीमान्‌को into श्री·मा·न्को, which does not
> concatenate back to the original.
>
> Secondly, you have a problem with ANUDATTA.  You are not accepting
>  as a syllable.  Perhaps you believed
> https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm
> as to the structure of a Devanagari syllable.  I suspect ANUDATTA as a
> consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the
> like came in.
>
> Richard.
>



Re: Counting Devanagari Aksharas

2017-04-21 Thread Richard Wordingham via Unicode
On Fri, 21 Apr 2017 00:08:24 -0500
Anshuman Pandey via Unicode  wrote:

> > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode
> >  wrote:

> > Now imagine you're
> > typing Vedic Sanskrit, with its clusters and pitch indicators.  
 
> I tried typing Vedic Sanskrit, and it seems to work:
 
> http://pandey.pythonanywhere.com/devsyll

That should demonstrate nothing relevant if you type correctly first
time.  The issue comes when you mistype and have to correct, to give
the usual worst case, the first letter of a conjunct.  Now, I looked at
your page in Firefox on Ubuntu, and I found the cursor seemed to move
by extended grapheme cluster.  That means that to change a consonant
you have to retype the following marks.

I did find two issues with your analyser.

Firstly, it broke श्रीमान्‌को into श्री·मा·न्को, which does not
concatenate back to the original.

Secondly, you have a problem with ANUDATTA.  You are not accepting
 as a syllable.  Perhaps you believed
https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm
as to the structure of a Devanagari syllable.  I suspect ANUDATTA as a
consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the
like came in. 

Richard.



Re: Counting Devanagari Aksharas

2017-04-20 Thread Anshuman Pandey via Unicode

> On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode 
>  wrote:
> 
> On Thu, 20 Apr 2017 14:14:00 -0700
> Manish Goregaokar via Unicode  wrote:
> 
>> On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
>>  wrote:
> 
>>> On Thu, 20 Apr 2017 11:17:05 -0700
>>> Manish Goregaokar via Unicode  wrote:
> 
 I'm of the opinion that Unicode should start considering devanagari
 (and possibly other indic) consonant clusters as single extended
 grapheme clusters.
> 
>>> You won't like it if cursor movement granularity is reduced to one
>>> extended grapheme cluster.  I'm grateful that Emacs allows me to
> 
>> I mean, we do the same for Hangul.
> 
> Hangul is generally a maximum of three characters, which is about the
> border of tolerance. I find it irritating to have to completely retype
> Thai grapheme clusters of consonant, vowel and tone mark.  There were
> loud protests from the Thais when preposed vowels were added to the
> Thai grapheme cluster and implementations then responded, and Unicode
> quickly removed them. Now imagine you're typing Vedic Sanskrit, with its
> clusters and pitch indicators.

I tried typing Vedic Sanskrit, and it seems to work:

http://pandey.pythonanywhere.com/devsyll

Haven't tried the orthographic oddity of the Nepali case in question. Above my 
pay grade.

If you access the above link on an iOS device you'll see tofu and missing 
characters. Apple's Devanagari font needs to be fixed.

- AP



Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 14:14:00 -0700
Manish Goregaokar via Unicode  wrote:

> On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
>  wrote:

> > On Thu, 20 Apr 2017 11:17:05 -0700
> > Manish Goregaokar via Unicode  wrote:

> >> I'm of the opinion that Unicode should start considering devanagari
> >> (and possibly other indic) consonant clusters as single extended
> >> grapheme clusters.

> > You won't like it if cursor movement granularity is reduced to one
> > extended grapheme cluster.  I'm grateful that Emacs allows me to

> I mean, we do the same for Hangul.

Hangul is generally a maximum of three characters, which is about the
border of tolerance. I find it irritating to have to completely retype
Thai grapheme clusters of consonant, vowel and tone mark.  There were
loud protests from the Thais when preposed vowels were added to the
Thai grapheme cluster and implementations then responded, and Unicode
quickly removed them. Now imagine you're typing Vedic Sanskrit, with its
clusters and pitch indicators.

> The main time you need intra-conjunct segmentation in Devanagari is
> when deleting something you just typed.

You'll typically be several words beyond by the time you notice, or by
the time a spell-checker spots a problem.

Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
I mean, we do the same for Hangul.

The main time you need intra-conjunct segmentation in Devanagari is
when deleting something you just typed. And backspace usually operates
on code points anyway (except for some weird cases like flag emoji,
though this isn't uniform across platforms). I don't see how
intra-conjunct selection would be useful otherwise.
-Manish


On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
 wrote:
> On Thu, 20 Apr 2017 11:17:05 -0700
> Manish Goregaokar via Unicode  wrote:
>
>> When given a rendered representation people seem to uniformly count
>> conjuncts as multiple aksharas if rendered with visible halant, and as
>> a single akshara if they are rendered conjoined.
>
> Now, that's what I expected.
>
>> I'm of the opinion that Unicode should start considering devanagari
>> (and possibly other indic) consonant clusters as single extended
>> grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
>> but sometimes family emoji will not render as a single glyph either
>> (if you use skin tones or more than 4 family members) and we still
>> consider those EGCs.
>
> You won't like it if cursor movement granularity is reduced to one
> extended grapheme cluster.  I'm grateful that Emacs allows me to
> delete and replace the first NFC character of a grapheme cluster.
>
> Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 11:17:05 -0700
Manish Goregaokar via Unicode  wrote:

> When given a rendered representation people seem to uniformly count
> conjuncts as multiple aksharas if rendered with visible halant, and as
> a single akshara if they are rendered conjoined.

Now, that's what I expected.

> I'm of the opinion that Unicode should start considering devanagari
> (and possibly other indic) consonant clusters as single extended
> grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
> but sometimes family emoji will not render as a single glyph either
> (if you use skin tones or more than 4 family members) and we still
> consider those EGCs.

You won't like it if cursor movement granularity is reduced to one
extended grapheme cluster.  I'm grateful that Emacs allows me to
delete and replace the first NFC character of a grapheme cluster.

Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 15:33:37 +0530
Shriramana Sharma via Unicode  wrote:
 
> All I can say is that Tamil script has eschewed most consonant cluster
> ligatures/conjoining forms. As for Devanagari, writing श्रीमान्‌को (I
> used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology.
> The latter would be Sanskrit orthography and former perhaps Hindi,
> although I wouldn't know why anyone would want to run in the को with
> the preceding श्रीमान् even in Hindi.

According to p23 of
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, it's
Nepali.  It's a compromise between श्रीमान्को and Hindi-style श्रीमान्
को.

> And IMO it would be better to
> clearly define at the outset what you meant by "akshara" in your
> question to avoid confusions by people replying having a different
> idea of the meaning of that term.

I didn't want to be any more precise than "orthographic syllable".
Swaran Lata is urging, in submission
http://www.unicode.org/L2/L2017/17094-indic-text-seg.pdf to the UTC,
that UAX#29 "Unicode Text Segmentation" adopt a rather naïve definition
of an Indian orthographic syllable.  The worst outcome in my opinion
would be if it were adopted for the extended grapheme cluster
definition - it would make editing orthographic clusters even more
difficult.  However, it would make sense for CLDR to carry localised
definitions.

For layout, the definition would be relevant for 'drop capital effects'
and for the analogue of inserting spaces between letters.  There are
recommendations in a maturing W3C specification for Indic layout,
though to be fair the specification fairly quickly restricts its scope
to Indian scripts.  Now, if the spacing were applied to the Nepali word 
श्रीमान्‌को I would expect to see something like श्री मा न् को, as the
base word itself would appear as श्री मा न्  when subjected to the
same treatment. However, before suggesting minor improvements that might
be in order, I thought I should check whether there was agreement that
 terminated an orthographic syllable.  It now seems that
any general agreement would in fact be that it did *not* terminate an
orthographic syllable!  I must say that stretching श्रीमान्‌को out as
श्री मा न्‌को  feels wrong.  If my feeling is right, then the definition
of orthographic syllable, if it can be done without reference to a
font, belongs in CLDR, as UAX#29 implies, and not in the Unicode
Character Database and Unicode standards.

Richard.



Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
I don't think there's consensus.

When given a rendered representation people seem to uniformly count
conjuncts as multiple aksharas if rendered with visible halant, and as
a single akshara if they are rendered conjoined.

Most fonts for devanagari these days are pretty good at conjoining
consonants. They seem to do so for all common conjuncts, and usually
for most practical (i.e. not ridiculously long) conjuncts. I've never
seen a visible halant in text I've read.

I'm of the opinion that Unicode should start considering devanagari
(and possibly other indic) consonant clusters as single extended
grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
but sometimes family emoji will not render as a single glyph either
(if you use skin tones or more than 4 family members) and we still
consider those EGCs.
-Manish


On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
 wrote:
> Is there consensus on how to count aksharas in the Devanagari script?
> The doubts I have relate to a visible halant in orthographic syllables
> other than the first.
>
> For example, according to 'Devanagari VIP Team Issues Report'
> http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
> derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
>  DEVANAGARI LETTER RA, U+0940 DEVANAGARI VOWEL SIGN II, U+092E
> DEVANAGARI LETTER MA, U+093E DEVANAGARI VOWEL SIGN AA, U+0928
> DEVANAGARI LETTER NA, U+094D, U+200C ZERO WIDTH NON-JOINER, U+0915
> DEVANAGARI LETTER KA, U+094B DEVANAGARI VOWEL SIGN O> and not
> श्रीमान्को   U+094D, U+0915, U+094B>.  Now, if the font used has a conjunct for
> SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
> and the latter as having 3 aksharas SH.RII, MAA, N.KO.
>
> If the font leads to the use of a visible halant instead of the vattu
> conjunct SH.RA, as happens when I view this email, would there then be
> 5 and 4 aksharas respectively?  A further complication is that the font
> chosen treats what looks like SH, RA as a conjunct; the vowel I appears
> to the left of SH when added after RA (श्रि).
>
> Richard.
>



Re: Counting Devanagari Aksharas

2017-04-20 Thread Shriramana Sharma via Unicode
Hello Richard. Yes my earlier reply wasn't intended to be offlist. I
have near-zero knowledge about non-Indic languages.

All I can say is that Tamil script has eschewed most consonant cluster
ligatures/conjoining forms. As for Devanagari, writing श्रीमान्‌को (I
used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology.
The latter would be Sanskrit orthography and former perhaps Hindi,
although I wouldn't know why anyone would want to run in the को with
the preceding श्रीमान् even in Hindi. And IMO it would be better to
clearly define at the outset what you meant by "akshara" in your
question to avoid confusions by people replying having a different
idea of the meaning of that term.



-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा



Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
I was offered the following reply:

> To my knowledge except in Tamil script vowel less consonants in
> written form aren't considered as separate "akshara"s in native
> terminology.

Word-finally they seem to be being treated as such.  To be more
precise, a final cluster of one or more consonants marked as having no
vowel is - Sanskrit has a few word-final clusters.

> However for text shaping purposes they will surely have
> to be considered as separate orthographic syllables in Unicode
> terminology since in word end position they can sometimes carry svara
> markers.

The complication comes word internally.  My understanding is that
phonetically syllable-final consonants in non-Indic words in
non-Indic languages have a tendency not to be included in an akshara
along with the start of the next syllable.  However, that tendency is
more evident in scripts other than Devanagari; Devanagari has developed
in the context of Indic languages.

Renderers' syllable-recognition algorithms will naturally treat
word-final devowelled sequences as separate units, rather than
associate them with the previous implicit or explict vowel.

Burmese is a good example of what can happen with a non-Indic language;
in native words, phonetic syllabic boundaries tend to be orthographic
syllable boundaries.

Text-shaping engines like Microsoft's Uniscribe are more complicated.
For scripts with a virama, they seem to assume that the virama may be
a combining operator, and wait for data from the font to decide how
many clusters to form.

One test is the insertion of white spaces in a word when it is stretched
out.  Of course, that test can only be applied where human decisions
are involved - otherwise we are just looking at what dominant
renderers are actually doing, rather than looking at what they ought
to be doing.

Richard.