Re: Devanagari and Subscript and Superscript

2015-12-16 Thread Doug Ewell
I missed this yesterday.

Plug Gulp wrote:

> General support for all characters, words and sentences could be
> achieved by just three new formatting characters, e.g. SCR, SUP and
> SUB, similar to the way other formatting characters such as ZWS, ZWJ,
> ZWNJ etc are defined. The new formatting characters could be defined
> as:
>
> SCR: In a character stream, all the characters following this
> formatting character shall be treated as [...]
>
> SUP: In a character stream, all the characters following this
> formatting character shall be treated as [...]
>
> SUB: In a character stream, all the characters following this
> formatting character shall be treated as [...]

This isn't similar to ZWSP or ZWJ or ZWNJ. Those formatting characters
are not stateful; they affect the rendering of, at most, the single
characters immediately preceding and following them.

The ones you suggest are stateful; they affect the rendering of
arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48
("ANSI") attribute switching, or ISO 2022 character-set switching.
Unicode tries hard to avoid encoding such things.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-16 Thread Philippe Verdy
2015-12-16 19:16 GMT+01:00 Doug Ewell :

> The ones you suggest are stateful; they affect the rendering of
> arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48
> ("ANSI") attribute switching, or ISO 2022 character-set switching.
> Unicode tries hard to avoid encoding such things.


You can try as hard as you want, there are cases where it is impossible to
avoid stateful encoding if we want to avoid desunifications, or even for
some characters that cannot even work without stateful analysis.

And this is not solved just by style markup when that "style" is in fact
completely semantic. The situation must be taken into account with more
care :

- For example, the superscript Latin letter o, aka "ordinal masculine",
which is not just a superscript but a notation adding the semantics of a
abbreviation for the final letters, linked to the other letters before it,
the whole being semantically a single word: the superscript style does not
create such attachment, it creates a separate "word" inside it, so it was
disunified from the letter o.

- But it is not a good practive to encode in Unicode things that are just
styles without clear semantics (so encoding SUB/SUP is really a bad idea).

- On the opposite it is simply impossible to work with Egyptian hieroglyphs
as the default clusters are clearly insufficient to create ANY kind of
plain-text: you need extra markup to add the necessary semantic, not style,
and this markup should be encodable as plain-text without external markup
for the presentation when this presenation is fully semantic and clear
(e.g. the Egyptian "cartouche" for names of kings).
- Similar issue occur with SingWriting and other scripts that DO require
always a complex (non-linear) layout where basic clusters are clearly
insufficient in ALL texts, meaning that the characters that were encoded
are almost **useless** in all plain-text documents: you need extra "format"
characters to create some form of orthographic rule, independantly of the
style or from an external markup language.

I'm in favor of adding **semantic** format characters in Unicode, not
stylistic-only format characters, as soon as there does exist a wellknown
orthographic convention which whould work independantly of styling. But for
now the encoded format characters only work on too small clusters, clusters
are only linear and this is clearly not enough (even for instructing other
kinds of text analysis (such as breakers).

Then the renderers will be adapted and extended to work with more complex
clusters with their internal structures with simpler clusters parts). Other
renderers using the legacy rules will not be able to do that but will
attempt to render some basic fallback (possibly with special visible glyphs
for those controls).

One kind of semantic format character which is useful and encoded is the
"invisible parentheses" for mathematics, which can be encoded for example
after a radical sign: use them around a number to define the extension of
the radical to more than one digit (and make a clear visual and semantic
distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render
any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and
"sqrt(2)+sqrt(3)").


Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Doug Ewell
Plug Gulp wrote:

> It will help if Unicode standard itself intrinsically supports
> generalised subscript/superscript text.

This falls outside the scope of "plain text" as defined by Unicode, in
much the same way as bold and italic styles and colors and font faces
and sizes.

There are several rich-text formats besides HTML that support arbitrary
subscript and superscript text. PDF and Word leap to mind.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-15 Thread srivas sinnathurai
Does the standard support the use of diacritics in plain text format, when used
with all and any complex scripts?

Regards

Sinnathurai

> 
> On 15 December 2015 at 17:46 Doug Ewell  wrote:
> 
> 
> Plug Gulp wrote:
> 
> > It will help if Unicode standard itself intrinsically supports
> > generalised subscript/superscript text.
> 
> This falls outside the scope of "plain text" as defined by Unicode, in
> much the same way as bold and italic styles and colors and font faces
> and sizes.
> 
> There are several rich-text formats besides HTML that support arbitrary
> subscript and superscript text. PDF and Word leap to mind.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 
> 
> 

>

RE: Devanagari and Subscript and Superscript

2015-12-15 Thread Doug Ewell
srivas sinnathurai wrote:

> Does the standard support the use of diacritics in plain text format,
> when used with all and any complex scripts?

It probably depends on what you mean by "support" and "diacritics." I
can type a Tamil letter followed by a combining acute accent or
diaeresis, and in Arial Unicode MS it actually looks halfway decent.
Many years ago, William Overington famously put a combining circumflex
on top of U+2604 COMET. You just type one character followed by another
and hope for the best, display-wise. You don't get any other special
behavior.

I'm not sure if this was supposed to be a comment on my statement that
arbitrary subscript and superscript is similar to other attributes that
are not defined to be part of plain text.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Plug Gulp
On Wed, Dec 9, 2015 at 5:18 AM, Martin J. Dürst  wrote:
>
> I suggest using HTML:
>
> बक ्ष
>

This will work only if the end-users are always going to use a web
browser to view the text content.

It will help if Unicode standard itself intrinsically supports
generalised subscript/superscript text. I think the meaning of the
text should be contained within the text itself rather than relying on
external text markers and viewers. That way the text-content creator
does not have to rely on what type of unicode compliant text viewer or
editor the end user is using. The text should retain it's meaning
irrespective of the type of unicode compliant text viewer or editor
used. Similarly, if the text has to be saved in a database without
losing it's meaning, then either it has to be saved with all the known
markers of all the available editors, or some special processing needs
to be incorporated to convert some saved marker to markers of various
available text viewers and editors. Having generalised Unicode support
for superscript and subscript will solve all these problems.

Following is one of the use-cases where general Unicode support for
superscript/subscript will help tremendously:

A math teacher(गणिताचे शिक्षक) in a Marathi(मराठी) language school is
writing notes, in her Unicode compliant plain text editor, to explain
mathematical terms to her students. Following is an excerpt from the
notes that explains terms such as exponents(घातांक) and base(पाया).
(English translation is given below):

"जेव्हा एखाद्या संखेचा स्वतःशीच अनेक वेळा गुणाकार होतो तेव्हा त्या
गुणाकाराला थोडक्यात लिहिण्याच्या पद्धतीला घातांक असे म्हणतात.
उदाहरणार्थ, ५ ही संख्या जर स्वतःशी ३ वेळा गुणली जात असेल, म्हणजे ५ x ५
x ५, तर त्याला घातांक पद्धतीत ५^३ असे लिहितात. ह्या घातांकीय रचनेला "५
चा ३ रा घात" असे म्हणतात. आपण अजून एक उदाहरण घेऊया, "२ ना चा १० वा
घात", म्हणजे २ ही संख्या स्वतःशी १० वेळा गुणली गेली आहे. ह्याला आपण
२^१० असे लिहितो. तर साधारणपणे, कूठलीही संख्या ब जेव्हा स्वतःशी क्ष
वेळा गुणलीजाते तेव्हा त्याला घातांक पद्धतीत ब^क्ष असे लिहितात, आणि
त्या रचनेला "ब चा क्ष वा घात" असे म्हणतात. इथे ब ह्या संखेला पाया
म्हणतात आणि क्ष ह्या संखेला घात असे म्हणतात. तर थोडक्यात, घातांकीय
रचनेला पाया^घात असे लिहितात."

English translation:
"Exponent is a shorthand notation that denotes a multiplication of a
number by itself a number of times. For example, if a number 5 is
multiplied by itself 3 times i.e. 5 x 5 x 5, then it is represented in
an exponential form as 5^3. This exponential term is referred to as "5
raise to the power of 3". Let us consider another example, "2 raise to
the power of 10", i.e. 2 is multiplied by itself 10 times. This is
written in exponential form as 2^10. So, in general any number b that
is multiplied by itself k number of times is written as b^k and the
term is referred to as "b raise to the power of k". The number b is
called the base, and the number k is called the exponent. In short,
exponential term is written as base^exponent."

Please note that the teacher had to use a Circumflex Accent (Caret) to
indicate superscript, which is an unwritten convention, in the absence
of proper superscript support within Unicode. To make the text
available to wider audience and still retain it's meaning, the teacher
will have to partly rely on Unicode support, partly on the markers
available in the various text viewers of her students, partly on the
markers available in the text editors of the peer-reviewers of her
text and partly on the unwritten convention(such as the caret). This
conundrum can be resolved only if there is a generalised support for
superscript and subscript within Unicode standard.

The standard already has a section for superscript and subscript.
Generalising and extending this support will help other languages and
scripts. General support for all characters, words and sentences could
be achieved by just three new formatting characters, e.g. SCR, SUP and
SUB, similar to the way other formatting characters such as ZWS, ZWJ,
ZWNJ etc are defined. The new formatting characters could be defined
as:

SCR: In a character stream, all the characters following this
formatting character shall be treated as normal text until either the
end of the character stream or the next SUP or SUB character is
reached. This shall be the default marker i.e. if no marker is
specified then the text shall be treated as normal text until either
the end of the character stream or the next SUP or SUB character is
reached.

SUP: In a character stream, all the characters following this
formatting character shall be treated as superscript text until either
the end of the character stream or the next SCR or SUB character is
reached.

SUB: In a character stream, all the characters following this
formatting character shall be treated as subscript text until either
the end of the character stream or the next SCR or SUP character is
reached.

A general support within Unicode for subscripting and superscripting

Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Richard Wordingham
On Tue, 15 Dec 2015 18:00:16 + (GMT)
srivas sinnathurai  wrote:

> Does the standard support the use of diacritics in plain text format,
> when used with all and any complex scripts?

Relatively few scalar value sequences are prohibited - just possibly
sequences containing unassigned characters that are not
non-characters, but I can't think of any others.  (The
prohibition on unpaired surrogates applies to coded character
sequences, but surrogate characters aren't scalar values.) 

It would appear by Conformance Requirement C5, 'A process shall not
assume that it is required to interpret any particular coded character
sequence', that a process is at liberty to decline to interpret a
sequence of scalar values, even if it has just interpreted it.

I am not aware of any requirements in the standard to interpret
specific character sequences.

In general, the interpretation of character sequences is undefined.
For example, a request for advice on the interpretation of
the combination of U+0331 COMBINING MACRON BELOW and U+0E39 THAI
CHARACTER SARA UU was answered with the instruction to consult the
non-existent typographical tradition.  It's been left to rendering
engine writers to define the interpretation.

Indeed, I am not sure that every sequence of defined scalar values
has an interpretation.  Most pairs of regional indicators don't have an
interpretation, and the interpretation of each variation sequences may
change at least twice, once when the base character becomes defined
(or is defined not to be a possible base character), and again when
the variation sequence is assigned an interpretation as an ill-defined
(or grossly ill-defined) family of glyphs.

Do U+0337 COMBINING SHORT SOLIDUS OVERLAY and U+20E5 COMBINING REVERSE
SOLIDUS OVERLAY have a defined interpretation when their base character
is to be represented by a mirrored glyph.  Note that in general, the
Unicode standard does not define when a character is to be represented
by a mirrored glyph.  This may be defined by a lower level protocol
(the font file).

Richard.


Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Khaled Hosny
On Tue, Dec 15, 2015 at 11:55:02AM +, Plug Gulp wrote:
> Please note that the teacher had to use a Circumflex Accent (Caret) to
> indicate superscript, which is an unwritten convention, in the absence
> of proper superscript support within Unicode.

If the teacher is explaining actual math to his students, then the
superscript is the least of his worries.

Math typesetting is two dimensional, and is much more complex than
regular formated text (not even regular plan text)that it needs its own
typesetting engines.

There are various plain text markup languages to markup math, if one
really wants to represent complex mathematical notation in plain text.


Regards,
Khaled


Re: Devanagari and Subscript and Superscript

2015-12-11 Thread Richard Wordingham
On Wed, 9 Dec 2015 03:24:39 +
Plug Gulp  wrote:

> I am trying to understand if there is a way to use Devanagari
> characters (and grapheme clusters) as subscript and/or superscript in
> unicode text.

Why do you want to do this?  Are you asking about writing Devanagari
vertically rather than horizontally?  If that is what you want, you
should be looking at mark-up such as is found in cascading style sheets
(CSS).  It is an important issue for CJK and Mongolian, and there have
been questions as to what is needed for Indian scripts.  (There's also
an antiquarian interest for historical scripts, such as Phags-pa and
even Egyptian - moves are afoot to support the hieroglyphic script as
plain text.)

Richard.


Re: Devanagari and Subscript and Superscript

2015-12-08 Thread Richard Wordingham
On Wed, 9 Dec 2015 03:24:39 +
Plug Gulp  wrote:

> Hi,
> 
> I am trying to understand if there is a way to use Devanagari
> characters (and grapheme clusters) as subscript and/or superscript in
> unicode text.

The view is that such would not be 'plain text', and therefore need not
be catered for in Unicode.  On the other hand, the desire for
spacing raised and lowered characters is sufficient that markup to
produce them is widely available, as Martin Dürst pointed out.

Non-spacing stacked characters are not common enough for general
support to be available.  In many Indic scripts, stacking is the normal
arrangement, and is supplied via a script-specific special character
that is overloaded with a vowel cancellation symbol.  However,
font-specific deviations from vertical stacking are arranged, and
vowels marks are treated independently.  There is no provision for
vertical stacks to have horiziontal offshoots.  (Scripts written
vertically are a different case.)

For characters stacked directly above and below not in the normal
modern fashion of writing words, there can be special characters for
special cases.  For example, there are U+A8EE COMBINING DEVANAGARI
LETTER PA in the Devanagari Extended block and U+0364 COMBINING LATIN
SMALL LETTER E.

Other, clumsier scheme-specific techniques are available other cases.
See for example the writing of nuclides with an explicit atomic number
in https://en.wikipedia.org/wiki/Nuclide.  The notation needs a mass
number at top left and an atomic number at bottom right.

A fairly general case is the annotation of kanji known as 'ruby'.
Sometimes an application or mark-up scheme will support this directly.

Richard.



Re: Devanagari and Subscript and Superscript

2015-12-08 Thread Martin J. Dürst

Hello Plug,

I suggest using HTML:

बक ्ष

Regards,   Martin.

On 2015/12/09 12:24, Plug Gulp wrote:

Hi,

I am trying to understand if there is a way to use Devanagari
characters (and grapheme clusters) as subscript and/or superscript in
unicode text. It will help if someone could please direct me to any
document that explains how to achieve that. Is there a unicode marker
that will treat the next grapheme cluster in the unicode text as
super/subscript? For e.g. if one wants to represent "ब raise to क्ष"
how does one achieve that; is there a marker to represent it as
follows: ब + SUP + क + ् + ष
where SUP acts as a marker for superscripting the next grapheme
cluster. Similar for subscripting.

Sorry if this is not the right place to ask this question; in that
case please could you direct me to the right forum?

Thanks and kind regards

~Plug

.



RE: Devanagari Letter Short A

2004-02-19 Thread Aparna A. Kulkarni
The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91.
Neither was it encoded in any of the earlier versions of ISCII. Hence
according to the ISCII standard this character simply cannot be formed.

Aparna A. Kulkarni

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Ernest Cline
Sent: Monday, February 16, 2004 10:59 AM
To: Unicode List
Subject: Devanagari Letter Short A

I've been trying to make sense of the Indian scripts, but am
having one small difficulty.  I can't seem to find the ISCII 1991
equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

Is this a character that is part of the set accessed by the
extended code (xF0) or was this part of the ISCII 1988
standard that did not survive the changes to ISCII 1991?

Alternatively, does ISCII encode this as xA4 + xE0 as this
would seem to generate the proper glyph even tho it
violates the syllable grammar given in Section 8 of ISCII?

Or even more alternatively, am I just missing something
that should be obvious, but which  for some reason I can't see?
Even with the slight differences in the naming conventions
between ISCII and Unicode, I don't seem to be misplacing
any of the other vowels or consonants.

Ernest Cline
[EMAIL PROTECTED]






Re: Devanagari Letter Short A

2004-02-19 Thread Philippe Verdy
From: Aparna A. Kulkarni [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; 'Unicode List' [EMAIL PROTECTED]
Sent: Thursday, February 19, 2004 8:23 AM
Subject: RE: Devanagari Letter Short A


 The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91.
 Neither was it encoded in any of the earlier versions of ISCII. Hence
 according to the ISCII standard this character simply cannot be formed.

 Aparna A. Kulkarni

So could this character exist only for the purpose of supporting languages that
are not covered by ISCII but that share the same Devanagari script, and is then
needed for other countries than India?

(Here I think about Dravidian transiptions).

If there's no ISCII standard related to its meaning or encoding, then what is
invalid when coding it with LETTER A then the LETTER SHORT E vowel modifier,
possibly with an intermediate INV or other ISCII-compatible control? How would
this break ISCII compatibility?

Aren't there existing practices to represent LETTER SHORT A in ISCII?




Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Philippe Verdy va escriure:
 
 U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an
 independant vowel. It can be viewed as a conjunct of the
 independant vowel U+0905 DEVANAGARI LETTER A and the dependant
 vowel sign U+0946 DEVANAGARI VOWEL SIGN SHORT E (noted for
 transcribing Dravidian vowels in the Unicode charts).

You may regard it this way, but that is not so.
U+0905 followed by U+0946 is really U+090E. Compare with the other
scripts to understand why.

 I  don't know why this is not documented, because I can find various
 sources that use U+0904 or U+0905,U+0946 which have exactly the
 same rendering and probably the same meaning and usage.

Whow! You have various sources that use a character added to Unicode
about 2 years and half ago! Impressionnant!

About the rendering of U+0905,U+0946, since it violates the usual
rules, it is up to your system. Mine does not render it properly,
though (unless I cheat).

 I think that U+0946 was added in ISCII 1991 but was absent from ISCII
 1988

No. It was there even in ISCII 83.

 (I think it's too late to define it: ISCII 1988 has been used 
 consistently before,

H... I have really no evidence that ISCII 1988 was used at all...
Would be happy to find one, though...


Antoine




Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Ernest Cline wrote:
 
 I've been trying to make sense of the Indian scripts, but am
 having one small difficulty.  I can't seem to find the ISCII 1991
 equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

I do not believe you'll find it there.
U+0904 had been added to Unicode for version 4.0. In 2001.
URL:http://www.unicode.org/consortium/utc-minutes/UTC-089-200111.html
Search for 89-C19.


 Is this a character that is part of the set accessed by the
 extended code (xF0) or was this part of the ISCII 1988
 standard that did not survive the changes to ISCII 1991?

No and no.

 
 Alternatively, does ISCII encode this as xA4 + xE0 as this
 would seem to generate the proper glyph even tho it
 violates the syllable grammar given in Section 8 of ISCII?

It does not. At the very least, if you want to generate this
character in ISCII this way, try A4 DB E0 (using INV).
This is an ugly hack, of course.

As an aside, in some version of ISCII (EA-ISCII, notably),
A4 E0 is supposed to be equivalent to AD. This is the way
the alphabet is sometimes taught to children in India.

 
Antoine



Re: Devanagari Letter Short A

2004-02-16 Thread Philippe Verdy
My understanding of the Indian scripts coded in Unicode, is that the mapping
from ISCII to Unicode is not straightforward one-to-one, because ISCII uses a
contextual encoding for characters (allowing shifts between several scripts) and
some rich-text features.

The ISCII character model is not exactly the same as the Unicode character
model, even though there was an attempt to make this mapping as simple as
possible by allocating the Unicode code points for each individual
ISCII-supported script in the same relative order, leaving gaps in the
Unicode-encoded scripts for ISCII characters that are not used in one specific
script.

The good reference for how Indian scripts are coded in Unicode is Chapter 9 of
the Unicode 4 reference:
http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf
In summary with Unicode, the model for Devenagari:
- uses consonnantal letters with an implied (default) vowel A, modified by the
next coded dependant vowel sign (matra) that create graphic conjuncts with the
consonnant, or
- uses half-forms of consonnants to drop the implied vowel in initial
consonnants, or
- uses a virama (halant) U+094D, to mark other omissions of the implied vowel on
dead consonnant letters (most often on final consonnants, but this occurs as
well on initial or medial consonnants), by removing the final stem of the full
(live) consonnant that is normally used to depict also a phonetic syllable
boundary with a necessary vowel. So the virama allows creating conjuncts with
other following dead consonnants or live consonnants, and normally attaches both
consonnant letters into the same syllable or conjunct.
- in some cases, the omission of the implied dependant vowel must not create a
ligated conjunct, so the virama still needs to represent the omission of the
vowel without creating a conjunct that would break the perceived phonetic, and a
ZWNJ is used between the dead consonnant (consonnant letter+virama) and the next
live consonnant.

There's a U+0905 pseudo-consonnant /a/ which is used in absence of a phonetic
consonnant, but it follows the same encoding rule as other consonnant letters
/*a/, i.e. coding another isolated vowel requires coding /a/ before the vowel
sign (matra). This encodes approximately the same thing as isolated vowels,
except that the intended rendering is different.

U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an independant
vowel. It can be viewed as a conjunct of the independant vowel U+0905
DEVANAGARI LETTER A and the dependant vowel sign U+0946 DEVANAGARI VOWEL SIGN
SHORT E (noted for transcribing Dravidian vowels in the Unicode charts). I
don't know why this is not documented, because I can find various sources that
use U+0904 or U+0905,U+0946 which have exactly the same rendering and
probably the same meaning and usage. I think that U+0946 was added in ISCII 1991
but was absent from ISCII 1988 (verify, I don't have the ISCII 1988 reference
document), so U+0904 has survived just to allow a mostly one-to-one mapping with
ISCII 1988. But the addition of U+0946

May be I'm wrong here, and there's some reasons for this choice. there's no
canonical or compatibility equivalence defined between U+0904 and
U+0905,U+0946 (I think it's too late to define it: ISCII 1988 has been used
consistently before, and the Unicode stability policy forbids now defining now
new equivalences between them).

- Original Message - 
From: Ernest Cline [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Monday, February 16, 2004 6:28 AM
Subject: Devanagari Letter Short A


 I've been trying to make sense of the Indian scripts, but am
 having one small difficulty.  I can't seem to find the ISCII 1991
 equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

 Is this a character that is part of the set accessed by the
 extended code (xF0) or was this part of the ISCII 1988
 standard that did not survive the changes to ISCII 1991?

 Alternatively, does ISCII encode this as xA4 + xE0 as this
 would seem to generate the proper glyph even tho it
 violates the syllable grammar given in Section 8 of ISCII?

 Or even more alternatively, am I just missing something
 that should be obvious, but which  for some reason I can't see?
 Even with the slight differences in the naming conventions
 between ISCII and Unicode, I don't seem to be misplacing
 any of the other vowels or consonants.

 Ernest Cline
 [EMAIL PROTECTED]




Re: Devanagari Glottal Stop

2003-04-06 Thread Michael Everson
I wrote:

  I would have to disagree with these Indian experts in this instance.
 The Devanagari glottal stop does not have a dot, and indeed, in the
 languages which use it, this character will certainly coexist with
 the question mark. They have different shapes, and different
 functions.
At 15:03 -0800 2003-04-05, Mark Davis wrote:
Can you respond back to them with the information as to the 
languages involved?
I believe they read the Unicore list, don't they, Mark? N2543 and 
02/394 show the character used for the Limbu language, and shows the 
glyph without a dot and with a horizontal headbar, which the question 
mark never has. (It also shows an example where, because the 
typesetters didn't have the letter available they substituted a 
question mark, but that just goes to show that we need to encode 
this, because it is a letter, not a punctuation mark.)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Devanagari Glottal Stop

2003-04-05 Thread Michael Everson
I would have to disagree with these Indian experts in this instance. 
The Devanagari glottal stop does not have a dot, and indeed, in the 
languages which use it, this character will certainly coexist with 
the question mark. They have different shapes, and different 
functions.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Devanagari Glottal Stop

2003-04-05 Thread Mark Davis
Can you respond back to them with the information as to the languages
involved?

Mark
(  )

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, April 05, 2003 01:45
Subject: Re: Devanagari Glottal Stop


 I would have to disagree with these Indian experts in this instance.
 The Devanagari glottal stop does not have a dot, and indeed, in the
 languages which use it, this character will certainly coexist with
 the question mark. They have different shapes, and different
 functions.
 --
 Michael Everson * * Everson Typography *  * http://www.evertype.com






RE: Devanagari

2002-12-03 Thread Alan Wood
Vipul Garg wrote:

 I have downloaded your font chart for Devanagari, which is in the range
 from 0900 to 097F. I have also installed the Arial Unicode font supplied
 by Microsoft office XP suite. I found that not all characters are
 available for Devanagari. For example letters such as Aadha KA, Aadha KHA,
 Aadha GA etc. are not available. 
  
 These letters are required in the devanagari words such as KANYA, NANHA,
 PARMATMA etc.
  
 If you could provide the above letters then our requirement for formation
 of Devanagari words would be possible. This requirement is very crucial as
 we have a large volume project on Devanagari language involving data
 storage in Oracle database.
 
You could try using a different font, for example one of the specialist
Devanagari fonts listed at:

http://www.alanwood.net/unicode/fonts.html#devanagari

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)





RE: Devanagari

2002-12-03 Thread Andy White

Vipal Garg was asking why half characters were not included in Unicode
code charts and in his copy of Arial Unicode font.
 
More recent versions of Arial Unicode Do contain half characters etc.
for Devanagari.
As to the code charts, to answer this, you needed to explore the Unicode
web site a bit more to find the answer.  Please see the following for
detailed information regarding the half characters etc:
http://www.unicode.org/unicode/standard/where/
http://www.unicode.org/unicode/faq/indic.html
http://www.unicode.org/unicode/uni2book/ch09.pdf

Best Regards
Andy

You Wrote:
I have downloaded your font chart for Devanagari, which is in the range
from 0900 to 097F. I have also installed the Arial Unicode font supplied
by Microsoft office XP suite. I found that not all characters are
available for Devanagari. For example letters such as Aadha KA, Aadha
KHA, Aadha GA etc. are not available. 
 
These letters are required in the devanagari words such as KANYA, NANHA,
PARMATMA etc.





RE: Devanagari

2002-12-03 Thread Marco Cimarosti
Vipul Garg wrote:
 I have downloaded your font chart for Devanagari, which is in 
 the range from 0900 to 097F. I have also installed the Arial 
 Unicode font supplied by Microsoft office XP suite. I found 
 that not all characters are available for Devanagari. For 
 example letters such as Aadha KA, Aadha KHA, Aadha GA etc. 
 are not available. 
  
 These letters are required in the devanagari words such as 
 KANYA, NANHA, PARMATMA etc.
  
 If you could provide the above letters then our requirement 
 for formation of Devanagari words would be possible. This 
 requirement is very crucial as we have a large volume project 
 on Devanagari language involving data storage in Oracle database.
  
 Would appreciate an early reply.

Please, see document Where is my character:

http://www.unicode.org/unicode/standard/where/

Also have a look to question 17 in the Indic FAQ:

http://www.unicode.org/unicode/faq/indic.html#17

All is explained in more detail in Section 9.1 Devanagari of the Unicode
manual:

http://www.unicode.org/unicode/uni2book/ch09.pdf

Regards.
M.C.




Re: Devanagari

2002-12-03 Thread John Cowan
[EMAIL PROTECTED] scripsit:

 Au contraire! You might find the attached gif of interest. (This is version
 1.0 of the font. Some people might have earlier versions.)

Ah, excellent.  It has not always been so.

 If you're not getting Indic shaping with Arial Unicode MS, it's very likely
 the fault of your software, not the font (and, of course, not Unicode).

Indeed, but the original poster specified the use of XP (Windows or Office,
I forget which), so I discounted that.

-- 
They do not preach  John Cowan
  that their God will rouse them[EMAIL PROTECTED]
A little before the nuts work loose.http://www.ccil.org/~cowan
They do not teach   http://www.reutershealth.com
  that His Pity allows them --Rudyard Kipling,
to drop their job when they damn-well choose.   The Sons of Martha




RE: Devanagari

2002-12-03 Thread Vipul Garg
I came across 'Where is my character?' page and read that there is a
combination of keystrokes to represent the Indic half forms, such as KA
and Halant combines to form half KA. Also there is a list of other
letter representation through combination of Devanagari letters.

Please email me the list for my ready reference.

Best Regards,

Vipul Garg

Mind Axis (I) Solutions Pvt. Ltd.
Phone: +91 (22) 55994860 / 61 
-Original Message-
From: John Cowan [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, December 03, 2002 5:33 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Devanagari

Vipul Garg scripsit:

 I have downloaded your font chart for Devanagari, which is in the
range
 from 0900 to 097F. I have also installed the Arial Unicode font
supplied
 by Microsoft office XP suite. I found that not all characters are
 available for Devanagari. For example letters such as Aadha KA, Aadha
 KHA, Aadha GA etc. are not available. 

This is not a Unicode problem.

Arial Unicode is not designed to handle Indic scripts; it does not
contain the necessary ligatures and half forms.  You need to use a
more suitable font.

-- 
John Cowan   [EMAIL PROTECTED]   http://www.ccil.org/~cowan
One time I called in to the central system and started working on a big
thick 'sed' and 'awk' heavy duty data bashing script.  One of the
geologists
came by, looked over my shoulder and said 'Oh, that happens to me too.
Try hanging up and phoning in again.'  --Beverly Erlebacher





RE: Devanagari variations

2002-03-11 Thread James E. Agenbroad

On Fri, 8 Mar 2002, Marco Cimarosti wrote:

 Peter Constable wrote:
  On 03/07/2002 02:16:10 PM James E. Agenbroad wrote:
  
  A similar but not the same situation is found in the fourth 
  example in
  figure 9-3 of Unicode 3.0 (page 214) where an intedpendent 
  vowel has the
  reph (an abridged form of a the consonant 'ra') above it.  Unicode 
  wants
  this encoded as consonant + halant + independent vowel. I 
  believe it is
  better considered as a consonant + vowel sign combination 
  which happens 
  to
  have an odd display and at least one Sanskrit textbook agrees.
  
  I may be wrong, but I believe that example has  ra, halant, ra, 
  independent i . The first ra is the one that  transforms 
  into the reph.
 
 You are wrong, in fact, sorry. Although figure 9-3 does not show code point
 values, both the glyphs and the abbreviated letter names make it clear that
 the sequence is:
 
   U+0930 (DEVANAGARI LETTER RA)
   U+094D (DEVANAGARI SIGN VIRAMA)
   U+090B (DEVANAGARI LETTER VOCALIC R)
 
 James' idea is that the same graphemes could have been better represented
 with sequence:
 
   U+0930 (DEVANAGARI LETTER RA)
   U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R)
 
 It is an interesting idea, because ra never occurs with matra r., so
 there is no danger of confusion. But it is probably too late for changing
 it: it would break compatibility with ISCII and existing Unicode fonts.
 
 _ Marco
 
 
 Monday, March 11, 2002
ra as reph does occur with r. cf. Monier Williams' Sanskrit-English
Dictionary, page 554, second column, between niru_ha and nire (using
underscore for macron and  for circumflex are nirr.i and nirr.ich and
nirr.ij. I believe ISCII is silent on this matter. If so, how can
compatibility with it be broken?  If fonts have this glyph can't they
allow two encodings to invoke it?  I do not advocate deletion or
deprecation of the encoding shown on page 214 of 3.0 for this glyph, I do
advocate saying somewhere in the Unicode Standard discussion of Devanagari
that there is another, more plausible and more Indian way to encode this
glyph.
   
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





RE: Devanagari variations

2002-03-08 Thread Marco Cimarosti

Peter Constable wrote:
 On 03/07/2002 02:16:10 PM James E. Agenbroad wrote:
 
 A similar but not the same situation is found in the fourth 
 example in
 figure 9-3 of Unicode 3.0 (page 214) where an intedpendent 
 vowel has the
 reph (an abridged form of a the consonant 'ra') above it.  Unicode 
 wants
 this encoded as consonant + halant + independent vowel. I 
 believe it is
 better considered as a consonant + vowel sign combination 
 which happens 
 to
 have an odd display and at least one Sanskrit textbook agrees.
 
 I may be wrong, but I believe that example has  ra, halant, ra, 
 independent i . The first ra is the one that  transforms 
 into the reph.

You are wrong, in fact, sorry. Although figure 9-3 does not show code point
values, both the glyphs and the abbreviated letter names make it clear that
the sequence is:

U+0930 (DEVANAGARI LETTER RA)
U+094D (DEVANAGARI SIGN VIRAMA)
U+090B (DEVANAGARI LETTER VOCALIC R)

James' idea is that the same graphemes could have been better represented
with sequence:

U+0930 (DEVANAGARI LETTER RA)
U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R)

It is an interesting idea, because ra never occurs with matra r., so
there is no danger of confusion. But it is probably too late for changing
it: it would break compatibility with ISCII and existing Unicode fonts.

_ Marco




Re: Devanagari variations

2002-03-08 Thread Michael Everson

At 15:36 -0600 07/03/2002, [EMAIL PROTECTED] wrote:

I may be wrong, but I believe that example has  ra, halant, ra,
independent i . The first ra is the one that  transforms into the reph.

You're wrong. RI in this case is a way of writing the vocalic r. 
Compare Kr.s.n.a and Krishna.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari variations

2002-03-08 Thread Michael Everson

At 15:16 -0500 07/03/2002, James E. Agenbroad wrote:
On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote:

  On 03/06/2002 08:25:18 AM Michael Everson wrote:
  [snip]

  In
  Cham, independent vowels can take dependent vowel signs. In
  Devanagari, I guess that doesn't occur, but the Brahmic model
  shouldn't be understood to preclude this behaviour.
 [snip]
  - Peter

A similar but not the same situation is found in the fourth example in
figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
reph (an abridged form of a the consonant 'ra') above it.  Unicode wants
this encoded as consonant + halant + independent vowel. I believe it is
better considered as a consonant + vowel sign combination which happens to
have an odd display and at least one Sanskrit textbook agrees.

Is that the sample you showed me when I was a-photocopying at the 
Library of Congress in August, James? You're saying that RA + virama 
+ INDEPENDENT VOCALIC R and RA + VOWEL SIGN VOCALIC R should both 
produce the same glyph?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari variations

2002-03-08 Thread Michael Everson

Using Apple's WorldText, I can confirm that short I did not reorder 
correctly when preceded by 0294. But the 0294 glyph was in another 
font.

I wonder could we see some samples of this in actual Limbu text?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: Devanagari variations

2002-03-08 Thread Michael Everson

At 11:26 +0100 2002-03-08, Marco Cimarosti wrote:

You are wrong, in fact, sorry. Although figure 9-3 does not show code point
values, both the glyphs and the abbreviated letter names make it clear that
the sequence is:

   U+0930 (DEVANAGARI LETTER RA)
   U+094D (DEVANAGARI SIGN VIRAMA)
   U+090B (DEVANAGARI LETTER VOCALIC R)

James' idea is that the same graphemes could have been better represented
with sequence:

   U+0930 (DEVANAGARI LETTER RA)
   U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R)

It is an interesting idea, because ra never occurs with matra r., so
there is no danger of confusion. But it is probably too late for changing
it: it would break compatibility with ISCII and existing Unicode fonts.

Well, Apple's in WorldText version 1.1 I just typed both of these. 
The first one displayed as RA VIRAMA (visible) VOCALIC R and the 
second displayed as REPHA VOCALIC R. So in at least one 
implementation the latter is supported.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari variations

2002-03-08 Thread Peter_Constable

On 03/08/2002 06:54:54 AM Michael Everson wrote:

Using Apple's WorldText, I can confirm that short I did not reorder
correctly when preceded by 0294. But the 0294 glyph was in another
font.

I wonder could we see some samples of this in actual Limbu text?

It's on its way.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari variations

2002-03-08 Thread Peter_Constable

On 03/08/2002 05:09:46 AM Michael Everson wrote:

At 15:36 -0600 07/03/2002, [EMAIL PROTECTED] wrote:

I may be wrong, but I believe that example has  ra, halant, ra,
independent i . The first ra is the one that  transforms into the reph.

You're wrong. RI in this case is a way of writing the vocalic r.
Compare Kr.s.n.a and Krishna.

I guess that's what I get for comment on things beyond my ken. Mea culpa.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari variations

2002-03-08 Thread Peter_Constable

Jim Agenbroad responded (off list):

Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a 
RI
vowel: RA(d) + RI(n) -- RI(n) +RA(sup)   ( parens in lieu ofsubscript)

I didn't realise that RI meant the vocalic R. I mistook it to mean 
something else. I find it a weakness of that section that such notations 
are not defined and prominently displayed in an easy-to-find location.

Thanks for setting me straight. I should have known you knew what you were 
talking about.


Peter




Re: Devanagari variations

2002-03-08 Thread John Cowan

[EMAIL PROTECTED] scripsit:

 I didn't realise that RI meant the vocalic R.  

It reflects the modern Hindi pronunciation of Skt /r=/.


-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_




RE: Devanagari enthousiasm!

2002-03-08 Thread Rick Cameron

It appears that hindi.exe installs Uniscribe - which, AFAIK, is not
permitted by Microsoft - so much for honouring license agreements!

That's another reason why they'd package it as an EXE.

- rick cameron

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, 6 March 2002 12:14
To: Yaap Raaf
Cc: [EMAIL PROTECTED]
Subject: Re: Devanagari enthousiasm!



On 06-03-2002 04:29:20 PM Yaap Raaf wrote:

At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote:

I am on a Mac and can't open it,

Well, this is going to be a problem for non-Windows clients, I admit.

it's a
244K .exe  Why an .exe?

I don't know if this is what the BBC was trying to do, but using an
executable installer package is at least one way to make sure people see the
license agreement...

Bob






Re: Devanagari variations

2002-03-08 Thread Michael Everson

At 10:29 -0600 2002-03-08, [EMAIL PROTECTED] wrote:
Jim Agenbroad responded (off list):

 Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a
RI
  vowel: RA(d) + RI(n) -- RI(n) +RA(sup)   ( parens in lieu ofsubscript)

I didn't realise that RI meant the vocalic R. I mistook it to mean
something else. I find it a weakness of that section that such notations
are not defined and prominently displayed in an easy-to-find location.

Actually, I would like to see that written R with dot below. We 
should use decent transliteration in those notations; why not?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari variations

2002-03-08 Thread James E. Agenbroad

On Fri, 8 Mar 2002, Michael Everson wrote:

 At 15:16 -0500 07/03/2002, James E. Agenbroad wrote:
 On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote:
 
   On 03/06/2002 08:25:18 AM Michael Everson wrote:
   [snip]
 
   In
   Cham, independent vowels can take dependent vowel signs. In
   Devanagari, I guess that doesn't occur, but the Brahmic model
   shouldn't be understood to preclude this behaviour.
  [snip]
   - Peter
 
 A similar but not the same situation is found in the fourth example in
 figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
 reph (an abridged form of a the consonant 'ra') above it.  Unicode wants
 this encoded as consonant + halant + independent vowel. I believe it is
 better considered as a consonant + vowel sign combination which happens to
 have an odd display and at least one Sanskrit textbook agrees.
 
 Is that the sample you showed me when I was a-photocopying at the 
 Library of Congress in August, James? You're saying that RA + virama 
 + INDEPENDENT VOCALIC R and RA + VOWEL SIGN VOCALIC R should both 
 produce the same glyph?
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
   Friday, March 8, 2002
Michael,
 Yes.  
 [Call lme Jim]
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Devanagari variations

2002-03-08 Thread James E. Agenbroad

On Fri, 8 Mar 2002 [EMAIL PROTECTED] wrote:

 Jim Agenbroad responded (off list):
 
 Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a 
 RI
 vowel: RA(d) + RI(n) -- RI(n) +RA(sup)   ( parens in lieu ofsubscript)
 
 I didn't realise that RI meant the vocalic R. I mistook it to mean 
 something else. I find it a weakness of that section that such notations 
 are not defined and prominently displayed in an easy-to-find location.
 
 Thanks for setting me straight. I should have known you knew what you were 
 talking about.
 
 
 Peter
 
 
   Friday, March 8, 2002
Peter,
 I agree there is a weakness there.  Maybe more than one. 
 I have mailed you (Peter) the Deshpande and Monier Williams examples
I cited.  
 Have a nice weekend all!
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





RE: Devanagari variations

2002-03-07 Thread Kent Karlsson



 implementations might 
 not recognise a sequence like  consonant, vowel, nukta  as 
 valid. For 
 instance, I understand that if Uniscribe encountered such a 
 sequence, it 
 would  assume you've left out a consonant immediately before 
 the nukta, 
 and it would display a dotted circle to indicate where a missing base 
 character should go.


That behaviour, IMHO, is incorrect.  There is no, and was never
any kind of grapheme or even combining sequence break
at that point, and there should never be a dotted circle
displayed through that sequence of characters (a show-
individual-characters mode should of course be excepted).

/kent k





Re: Devanagari enthousiasm!

2002-03-07 Thread Peter_Constable

On 03/06/2002 03:12:20 PM Michael Everson wrote:

But a font is not a ISO/IEC 10646 subset! By definition, it contains 
glyph
codes, not character codes. They are in two different worlds.

But in public procurement a subset may be specified, in which case
ASCII will be implied. I don't know who made up this rule, by the way.

I've never seen any font vendor advertise the capabilities of their fonts 
in terms of ISO 10646 subsets. In fact, I've never seen any reference to 
ISO 10646 subsets outside discussions of ISO 10646 such as this.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





RE: Devanagari variations

2002-03-07 Thread Peter_Constable

That behaviour, IMHO, is incorrect.  There is no, and was never
any kind of grapheme or even combining sequence break
at that point, and there should never be a dotted circle
displayed through that sequence of characters (a show-
individual-characters mode should of course be excepted).

I agree. What I'm hoping to find out is whether developers of various 
(whatever) software products can verify whether their code behave 
correctly in this regard. This would likely be speaking to ICU, Java or 
other implementations of Devanagari rendering, or Devanagari fonts (OT or 
otherwise).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari variations

2002-03-07 Thread James E. Agenbroad

On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote:

 On 03/06/2002 08:25:18 AM Michael Everson wrote: 
 [snip] 
 
 In
 Cham, independent vowels can take dependent vowel signs. In
 Devanagari, I guess that doesn't occur, but the Brahmic model
 shouldn't be understood to preclude this behaviour.
[snip]
 - Peter

A similar but not the same situation is found in the fourth example in
figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
reph (an abridged form of a the consonant 'ra') above it.  Unicode wants
this encoded as consonant + halant + independent vowel. I believe it is
better considered as a consonant + vowel sign combination which happens to
have an odd display and at least one Sanskrit textbook agrees.  

Didn't Mark Twain say he didn't think much of a person who could spell a
word in only one way?   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Devanagari variations

2002-03-07 Thread Peter_Constable

On 03/07/2002 02:16:10 PM James E. Agenbroad wrote:

A similar but not the same situation is found in the fourth example in
figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
reph (an abridged form of a the consonant 'ra') above it.  Unicode 
wants
this encoded as consonant + halant + independent vowel. I believe it is
better considered as a consonant + vowel sign combination which happens 
to
have an odd display and at least one Sanskrit textbook agrees.

I may be wrong, but I believe that example has  ra, halant, ra, 
independent i . The first ra is the one that  transforms into the reph.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari variations

2002-03-07 Thread Peter_Constable

I have gotten the answer on the question Michael raised about the glottal 
stop: it does *not* have an inherent vowel.

So, given that, I return to the original question: 

quote
The question is whether there is any problem using U+0294, and whether 
proposing a Devanagari-specific character would be a better option. One 
particular problem I can think would be likely to occur would be rendering 

engines such as Uniscribe or whatever is coded into host environments like 

Java for Hindi support would not be able to cope with U+0294 occuring in 

the midst of a Devanagari sequence. E.g. I could easily imagine something 
like Uniscribe failing to reorder U+093F before a glottal U+0294.

Should we try to educate and convince implementers of the need to allow 
U+0294 to be reckoned as part of the Devanagari script, or should we 
propose a new Devanagari glottal character?

(I guess on the principle of unifying across languages but not across 
scripts, it could be argued that a new character should be proposed.)
/quote


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]




- Forwarded by Peter Constable/IntlAdmin/WCT on 03/07/2002 10:10 PM 
-


Jeff  LB Webster [EMAIL PROTECTED]
03/07/2002 08:59 PM

 
To: [EMAIL PROTECTED]
cc: 
Subject:Re: Devanagari variations


Peter,

I responded to Steve also, but the short answer is NO, the glottal has no
inherent vowel.

Jeff






Re: Devanagari enthousiasm!

2002-03-06 Thread Bob_Hallissy



On 06-03-2002 09:59:48 Yaap Raaf wrote:

Win98: You need something called Opentype Devanagri fonts
to VIEW the Hindi unicode text.
You can get a good font for free from BBC Hindi site.

Except that the license that accompanies the font says:

   COPYRIGHT AND ALL OTHER RIGHT, TITLE AND INTEREST
   IN THE SOFTWARE BELONGS TO THE BBC, ITS LICENSORS
   AND SUPPLIERS.  THE BBC IS LICENSED BY ITS LICENSORS
   TO USE AND DISTRIBUTE THE SOFTWARE ON ITS WEBSITE
   WWW.BBC.CO.UK (THE WEBSITE) AND TO PERMIT YOU
   TO DOWNLOAD IT, COPY IT ON TO YOUR PERSONAL COMPUTER
   AND USE IT FOR THE PURPOSES OF ACCESSING, VIEWING
   AND MAKING, FOR YOUR OWN PERSONAL USE ONLY,
   ONE PRINT COPY OF  THE  HINDI LANGUAGE VERSION
   OF THE WEBSITE.  OTHER THAN AS PROVIDED FOR IN
   THIS SECTION, YOU MAY NOT MAKE ANY USE WHATSOEVER
   OF THE SOFTWARE OR THE HINDIFONT.

I interpret this to mean one may not legitimately use this font for any
purpose other than viewing the BBC website.

Bob





Re: fj ligature [Re: Devanagari variations]

2002-03-06 Thread John H. Jenkins


On Wednesday, March 6, 2002, at 03:24 AM, Herman Ranes wrote:

 There is a related problem in connection with Norwegian typography: Most 
 fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a 
 commercial font which includes the 'fj' ligature.


Apple's Hoeffler font contains an fj ligature.  If I'm not mistaken. most 
of Adobe's Pro fonts do, too.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: Devanagari variations

2002-03-06 Thread Michael Everson

At 00:12 -0600 2002-06-03, [EMAIL PROTECTED] wrote:

(1) The first problem is the need for a glottal character for Limbu (ie,
Limbu language written in Devanagri script, as opposed to Limbu script,
which already has a symbol for glottal). The Limbu language committee has
decided that this character should be represented using what looks pretty
much like the IPA glottal symbol (U+0294), though in a Devanagari font it
would have to be designed to match Devanagari characters.

I see this in version 2 of the Nepali White Paper 
http://www.cicc.or.jp/english/hyoujyunka/mlit3/7-7-2.pdf

The question is whether there is any problem using U+0294, and whether
proposing a Devanagari-specific character would be a better option. One
particular problem I can think would be likely to occur would be rendering
engines such as Uniscribe or whatever is coded into host environments like
Java for Hindi support would not be able to cope with U+0294 occuring in
the midst of a Devanagari sequence. E.g. I could easily imagine something
like Uniscribe failing to reordering U+093F before a glottal U+0294.

That almost answers my first question. Does Devanagari glottal have 
an inherent vowel? If it does, encode a new character.

(2) The second problem involves nukta (U+093C). In better-known languages,
nukta can occur only on consonants, but for certain lesser-known
languages, it can occur on vowels as well. Yet some implementations might
not recognise a sequence like  consonant, vowel, nukta  as valid. For
instance, I understand that if Uniscribe encountered such a sequence, it
would  assume you've left out a consonant immediately before the nukta,
and it would display a dotted circle to indicate where a missing base
character should go.

So what would you suggest? A vocalic-nukta? I wouldn't like that. In 
Cham, independent vowels can take dependent vowel signs. In 
Devanagari, I guess that doesn't occur, but the Brahmic model 
shouldn't be understood to preclude this behaviour.

Our people in South Asia have told me the nukta can occur on vowels in the
range U+093e..U+094c, though my contact has told me that he himself has
only seen this on 093E, 0940, 0941 and 094B.

Um, that's AA, II, U, and O. What does the nukta make them sound like?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: fj ligature [Re: Devanagari variations]

2002-03-06 Thread

* Herman Ranes [EMAIL PROTECTED] [2002-03-06 11:24]:
 There is a related problem in connection with Norwegian typography: 
 Most fonts include the 'fi' and 'ffi' ligatures, but I have never 
 heard of a commercial font which includes the 'fj' ligature.

From the Adobe OpenType user guide:
(http://www.adobe.com/type/browser/pdfs/OTGuide.pdf)

# Many Adobe Pro fonts include a large set of standard ligatures, such as
# fi, fl, ffi, ffl, ff, fj, ffj, Th, and others. Most other Adobe OpenType
# fonts have a smaller set of standard ligatures, like those in Type 1
# fonts.

So, you can find some, at least from Adobe, and I guess other quality
font vendors.

-- 
  * [EMAIL PROTECTED]





Re: Devanagari enthousiasm!

2002-03-06 Thread Yaap Raaf

At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote:

I interpret this to mean one may not legitimately use this font for any
purpose other than viewing the BBC website.

If http://www.bbc.co.uk/hindi/images/download_text.gif 
is any indication, the font doesn't look too promising. 
Have you seen it? I am on a Mac and can't open it, it's a
244K .exe  Why an .exe?

BTW, the menus and headers at http://www.bbc.co.uk/hindi/ 
come out as gibberish on my screen, the articles are fine. 
(MacOS 9.0)  

The great enthousiasme in the message I forwarded was what 
struck me, and then, several times the word 'STANDARD' in capitals!
People have to see results before they believe, and spread the word. 



There was another message announcing Raghu font.

Subject: Free Unicode Hindi fonts
From:Dakshin Shantakumar [EMAIL PROTECTED]
Newsgroups:  alt.language.hindi soc.culture.indian
Date:2 Mar 2002 13:51:45 -0800

Downloadable here
  http://www.ncst.ernet.in/~matra/hindi_display.shtml


It has about 600 glyphs. But no Latin letters, which, IIRC,
disqualifies it as a real Unicode font? 

Some of the glyphs are are unfamiliar to me. 




Yaap


-- 


attachment: Raghu_glyphs.jpg


Re: Devanagari enthousiasm!

2002-03-06 Thread Michael Everson

At 17:29 +0100 2002-06-03, Yaap Raaf wrote:

There was another message announcing Raghu font.

Subject: Free Unicode Hindi fonts
From:Dakshin Shantakumar [EMAIL PROTECTED]
Newsgroups:  alt.language.hindi soc.culture.indian
Date:2 Mar 2002 13:51:45 -0800

Downloadable here
   http://www.ncst.ernet.in/~matra/hindi_display.shtml


It has about 600 glyphs. But no Latin letters, which, IIRC,
disqualifies it as a real Unicode font?

Some of the glyphs are are unfamiliar to me.

The first one is RA + VIRAMA + VOCALIC R, it looks to me. That would 
be, in principle, a valid but rare syllable. Maybe it's Vedic.

The second one, I'm not sure what to think of. It looks like NGA + RA 
+ NUKTA + VOCALIC R, and I couldn't begin to wonder what it's 
supposed to represent.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: fj ligature [Re: Devanagari variations]

2002-03-06 Thread Peter_Constable

On 03/06/2002 04:24:54 AM Herman Ranes wrote:

There is a related problem in connection with Norwegian typography:
Most fonts include the 'fi' and 'ffi' ligatures, but I have never
heard of a commercial font which includes the 'fj' ligature.

That's quite a different problem. All it would take to support that in 
Uniscribe/OpenType would be to add the appropriate ligature mapping in a 
GSUB OpenType table; nothing needs to change in Uniscribe. But for  cons, 
vow, nukta , Uniscribe will insert U+25CC as a UI device to let the user 
know that something is wrong with the data, but that's based on an 
assumption that nukta can't go only go on a consonant.

To solve your problem, you just need to educate type designers of the 
need. It's relatively simple to solve.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: fj ligature [Re: Devanagari variations]

2002-03-06 Thread John Hudson

At 02:24 3/6/2002, Herman Ranes wrote:

There is a related problem in connection with Norwegian typography: Most 
fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a 
commercial font which includes the 'fj' ligature.

Using such a font, the word 'fire' (four) would be ligated correctly, 
while 'fjerde' (fourth) would not.

And exactly what does the rendering of the 'international' loan-word 
/fjord/ look like in printed matter around the world? I regularly find it 
unligated in English and German reference works which have in other 
aspects virtually perfect typography.

With the increased support of OpenType layout, I think you will see more 
fonts supplying an fj ligature. The Adobe Pro fonts contain this ligature, 
as do any Latin script fonts I have produced for clients over the past 
three years. Most type designers I know are aware of the need for this 
glyph, but until now they have not had a standard and reliable way to make 
it available.

This is not really the same issue as Peter has raised with regard to Limbu, 
where the issue is less the availability of a particular glyph but the 
handling of a character by shaping engines that might fail to identify it 
is part of the surrounding text. The former is a typographic and font 
issue, while the latter is a text processing issue and may be an encoding 
issue.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

... es ist ein unwiederbringliches Bild der Vergangenheit,
das mit jeder Gegenwart zu verschwinden droht, die sich
nicht in ihm gemeint erkannte.

... every image of the past that is not recognized by the
present as one of its own concerns threatens to disappear
irretrievably.
   Walter Benjamin





Re: Devanagari enthousiasm!

2002-03-06 Thread John Hudson

At 08:29 3/6/2002, Yaap Raaf wrote:

There was another message announcing Raghu font.

 Subject: Free Unicode Hindi fonts
 From:Dakshin Shantakumar [EMAIL PROTECTED]
 Newsgroups:  alt.language.hindi soc.culture.indian
 Date:2 Mar 2002 13:51:45 -0800

Downloadable here
   http://www.ncst.ernet.in/~matra/hindi_display.shtml

It has about 600 glyphs. But no Latin letters, which, IIRC,
disqualifies it as a real Unicode font?

No, a Unicode font does not need to contain Latin letters. There are issues 
regarding using such fonts in Windows 9x and ME, because these systems 
require 8-bit codepage support and there are no MS codepages for Indic 
scripts. This is why MS Mangal, the Hindi UI font that ships with Windows 
2000 and XP is not licensed for use on older versions of the OS.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

... es ist ein unwiederbringliches Bild der Vergangenheit,
das mit jeder Gegenwart zu verschwinden droht, die sich
nicht in ihm gemeint erkannte.

... every image of the past that is not recognized by the
present as one of its own concerns threatens to disappear
irretrievably.
   Walter Benjamin





Re: Devanagari enthousiasm!

2002-03-06 Thread Michael Everson

At 11:03 -0800 2002-06-03, John Hudson wrote:

It has about 600 glyphs. But no Latin letters, which, IIRC,
disqualifies it as a real Unicode font?

No, a Unicode font does not need to contain Latin letters.

A valid ISO/IEC 10646 subset must contain ASCII.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari enthousiasm!

2002-03-06 Thread Rick McGowan

At 11:03 -0800 2002-06-03, John Hudson wrote:

 No, a Unicode font does not need to contain Latin letters.

And Michael Everson responded:

 A valid ISO/IEC 10646 subset must contain ASCII.

But a font is not a ISO/IEC 10646 subset! By definition, it contains glyph  
codes, not character codes. They are in two different worlds.

So it's still true that a font compatible with Unicode need not contain  
Latin letters.

Rick





Re: Devanagari enthousiasm!

2002-03-06 Thread Bob_Hallissy


On 06-03-2002 04:29:20 PM Yaap Raaf wrote:

At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote:

I am on a Mac and can't open it,

Well, this is going to be a problem for non-Windows clients, I admit.

it's a
244K .exe  Why an .exe?

I don't know if this is what the BBC was trying to do, but using an
executable installer package is at least one way to make sure people see
the license agreement...

Bob






Re: Devanagari enthousiasm!

2002-03-06 Thread John Cowan

Michael Everson scripsit:

 A valid ISO/IEC 10646 subset must contain ASCII.

But a 10646 subset is a coded character set, not a font.

-- 
John Cowan   http://www.ccil.org/~cowan  [EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
--_The Hobbit_




Re: Devanagari enthousiasm!

2002-03-06 Thread Michael Everson

At 12:07 -0800 2002-06-03, Rick McGowan wrote:
At 11:03 -0800 2002-06-03, John Hudson wrote:

  No, a Unicode font does not need to contain Latin letters.

And Michael Everson responded:

  A valid ISO/IEC 10646 subset must contain ASCII.

But a font is not a ISO/IEC 10646 subset! By definition, it contains glyph 
codes, not character codes. They are in two different worlds.

But in public procurement a subset may be specified, in which case 
ASCII will be implied. I don't know who made up this rule, by the way.

So it's still true that a font compatible with Unicode need not contain
Latin letters.

OK. Caveat emptor.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




10646 subsets (was: Re: Devanagari enthousiasm!)

2002-03-06 Thread Kenneth Whistler

Michael Everson said:

 No, a Unicode font does not need to contain Latin letters.
 
 A valid ISO/IEC 10646 subset must contain ASCII.

Besides others pointing out the obvious disconnect
between 10646 subsets and what can be in a valid
Unicode font (which contains glyphs, not characters),
this statement is not correct even in its proper
context. To cite chapter and verse:

10646 defines two kinds of subsets:

Limited subsets (clause 12.1) are simply enumerations of
any list of code points. (code positions in 10646-speak)
There are no constraints on this kind of subset, so
it could consist merely of a list of Hebrew combining
marks, for example.

Selected subsets (clause 12.2) consist of lists of
collections from Annex A. It is *selected* subsets
which automatically contain U+0020..U+007E. And
note that it is only *those* code points which
are included, and not ASCII -- which would also
imply inclusion of U+..U+001F and U+007F.

--Ken




Re: Devanagari variations

2002-03-06 Thread Peter_Constable

On 03/06/2002 08:25:18 AM Michael Everson wrote:

That almost answers my first question. Does Devanagari glottal have
an inherent vowel? If it does, encode a new character.

That seems like a very good metric to consider, and I hadn't thought of it 
myself. I'd expect that this can be used syllable-initially rather than 
only finally, and so would have an inherent vowel but don't know that for 
certain. I've asked my contacts working in S. Asia for further info.


(2) The second problem involves nukta (U+093C). In better-known 
languages,
nukta can occur only on consonants, but for certain lesser-known
languages, it can occur on vowels as well. Yet some implementations 
might
not recognise a sequence like  consonant, vowel, nukta  as valid. For
instance, I understand that if Uniscribe encountered such a sequence, it
would  assume you've left out a consonant immediately before the nukta,
and it would display a dotted circle to indicate where a missing base
character should go.

So what would you suggest? A vocalic-nukta? I wouldn't like that. 
 
No, I wouldn't suggest anything different. The question is mainly intended 
to find out to what extent implementers are making assumptions that would 
present problems.


In
Cham, independent vowels can take dependent vowel signs. In
Devanagari, I guess that doesn't occur, but the Brahmic model
shouldn't be understood to preclude this behaviour.

There's a general problem: writing systems of lesser-known languages 
sometimes involve behaviours that don't occur in the writing systems of 
better-known languages, but software implementations get designed based 
upon what is known, meaning the better-known writing systems only, and 
sometimes implementation incorporate constraints based upon what is 
exemplified in those better-known writing systems. E.g. there are 
Mon-Khmer languages spoken in Thailand that get written with Thai script 
but have many more vowel distinctions than Thai and so need to use 
combinations of combining marks not used in combination for Standard Thai, 
yet some important software implementations incorporate sequence 
constraints that treat these combinations as error conditions.



Um, that's AA, II, U, and O. What does the nukta make them sound like?

I haven't any idea, myself. 


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari keyboard for WindowsXP

2002-02-24 Thread Peter_Constable

On 02/23/2002 08:58:28 PM yaapraaf wrote:

Now I have one more question, related with the next:

 # Subject: Re: How to create Unicode input methods for MacOS? (long)

 # Our 'uchr' resources are created using an assembler. It's the
 # only tool we are aware of that can fill in the offsets the 'uchr'
 # data structure contains. We don't have a custom 'uchr' editing
 # tool at this point (we would love to have one...).

If a such a comparison could be made, does the Keyman wizard
represent the kind of 'intelligence' of the assembler, or is there
a fundamental difference between Windows and Mac systems that
makes keyboard editing on Windows less difficult than it is for the
Mac in the case of 'uchr'? What's the difference?

I don't know these things in detail; when I need to know the details I go 
to my co-worker Jonathan Kew. I have some ideas, though.

As I understand it, there are some significant differences. First, my 
understanding is that the Mac uchr resources (and their predecessor, kchr 
resources) are tables compiled in a binary format that map from scan codes 
to characters, usually in a 1:1 manner, though I know kchr could support 
dead keys -- i.e. a one-level subtable, so that if a key designated to be 
a deadkey was pressed, then the following keystroke used a different set 
of lookups. If I recall, kchr resources were part of a script bundle, 
meaning that each kchr resource had a script code, and there could be at 
most one kchr resource per script code. I seem to recall that there were 
32 possible script codes, but I'd really need to check the docs to make 
sure. (You can find this stuff on the Apple site if you know where to 
dig.) I don't know what happens in these regards with uchr resources.

My understanding of these kinds of low-level details is also imperfect, 
but a little better than for the Mac. (When I really need to know details 
in this area that I'm rusty on, I ask Marc Durdin, the author of Keyman.) 
From what I understand, Windows is somewhat different, and adding Keyman 
into the mix makes even more different. Windows uses individual files --- 
.kbd on Win9x/Me and .dll on NT/2K/XP -- that contain mapping tables to 
map scan codes into virtual characters and character codes. There can be 
any number of these on a system, but they have to be associated with a 
LANGID to use them. Win32 allows lots LANGIDs -- way more than the number 
of script codes allowed in QuickDraw. Win32 also allows multiple kdbs/dlls 
to be associate with a given LANGID, except that there's something in 
Win9x/Me that keeps this from working (I don't know if it's just a UI 
problem or something at a lower level). I don't know exactly what kinds of 
mappings can be created in a kbd/dll, but it wouldn't surprise me to learn 
that it's pretty much comparable to what could be done in a KCHR resource.

Now, Keyman is really an entirely different mechanism. It doesn't create a 
distinct kbd/dll, but makes use of one on the system. It will intercept 
what is generated by the system, and use its own mechanisms to map to 
character codes. And its mechanisms are *far* richer than 1:1 mappings or 
1:1 mappings plus deadkeys. This is so because it was first created for 
use with Win3.x to create presentation form-encoded data involving scripts 
of SE Asia. In other words, it needed to handle some of the kinds of 
transformations that are handled today by Uniscribe and OpenType.

You can think of Keyman's wizard as similar to an assembler, except that 
what it generates is not processor assembly code but rather rules in the 
Keyman keyboard description language. For instance, if I use the Keyman 
wizard and drag the shape for a Devanagari KA (U+0915) onto the K key in 
the screen representation of a keyboard, then it will generate the rule

+ k  U+0915

in the KMN text file that constitutes the programming code for the input 
method being created. After you have used the wizard, you can revise the 
KMN program in whatever way you want, just as if you hadn't used the 
wizard but were writing the behaviour by hand -- though you wouldn't need 
to do this if you're input method only requires simple, context-free 
keystroke-to-character mappings.



It may be clear I don't know an assembler from a wizard, but I'm
just amazed about the 'problem' at Apple and the (seemingly)
simple solution for Windows in this regard.

Several years ago, Jonathan Kew created the SILKey program, which is 
basically Keyman for the Mac. (There was a misunderstanding about some 
details in the Keyman documentation that resulted in some slight 
differences between Keyman's description language and that used by SILKey, 
but eventually both programs had been revved so that a single description 
could be written to be used on both platforms.) SILKey is available from 
the SIL web site. But, it has *not* been updated to support Unicode. There 
are some issues that would be involved in doing so, and since there aren't 
too many Unicode apps 

Re: Devanagari keyboard for WindowsXP

2002-02-23 Thread Peter_Constable

On 02/22/2002 05:26:53 PM Yaap Raaf wrote:

I've also been looking at Tavultesoft's Keyman, but there are no
readymade keyboards available for the purpose. I don't know
how complicated it is to develop one.

For simple behaviours, it can be quite easy; e.g. if you just need to 
assign Devanagari characters to keys on a US keyboard without any 
additional behaviour considerations, there's a wizard with a visual UI 
that you can use. If you need, though, you can create fairly sophisticated 
input methods. How difficult it is depends on how complex the required 
behaviour is, but the learning curve is **far** less than learning to 
program in C or to use the Windows DDK.

And I've never heard of anyone complaining that a Keyman input method 
could not keep up with a user's typing speed.

BTW, Keyman 6 (in development) will support Microsoft's Text Services 
Framework, and also (I think) rules that are sensitive to both preceding 
and following context. That should make it possible to create fairly 
polished text-editing user interfaces even for Indic scripts or Arabic, 
where there are complex character/presentation relationships involved. 
(E.g. I'd guess that Arabic users would find it less distracting if 
characters appeared initially in non-final forms and only change to final 
forms after a word-breaking or non-joining character has been entered. If 
that's true, I can envision implementing that without too much difficulty 
using Keyman.)



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Devanagari keyboard for WindowsXP

2002-02-23 Thread yaapraaf

At 15:08 +0100 2002.02.23, [EMAIL PROTECTED] wrote:

On 02/22/2002 05:26:53 PM Yaap Raaf wrote:

I've also been looking at Tavultesoft's Keyman, but there are no
readymade keyboards available for the purpose. I don't know
how complicated it is to develop one.

For simple behaviours, it can be quite easy; e.g. if you just need to 
assign Devanagari characters to keys on a US keyboard without any 
additional behaviour considerations, there's a wizard with a visual UI 
that you can use. If you need, though, you can create fairly sophisticated 
input methods. How difficult it is depends on how complex the required 
behaviour is, but the learning curve is **far** less than learning to 
program in C or to use the Windows DDK.

Thanks Peter, I had hoped for your response as you are the 
Keyman specialist here.  A few others responding off list were of 
the same opinion. So, no aksharmala, but Keyman. 

Now I have one more question, related with the next:

 # X-UML-Sequence: 13313 (2000-04-17 20:30:54 GMT)
 # X-Nutritional-Content-Warning: Message body contains 53% quoted lines
 # From: Deborah Goldsmith [EMAIL PROTECTED]
 # To: Unicode List [EMAIL PROTECTED]
 # Cc: Marco Piovanelli [EMAIL PROTECTED]
 # Date: Mon, 17 Apr 2000 12:30:53 -0800 (GMT-0800)
 # Subject: Re: How to create Unicode input methods for MacOS? (long)
 # [.]
 # [.]
 # Our 'uchr' resources are created using an assembler. It's the 
 # only tool we are aware of that can fill in the offsets the 'uchr' 
 # data structure contains. We don't have a custom 'uchr' editing 
 # tool at this point (we would love to have one...).

If a such a comparison could be made, does the Keyman wizard 
represent the kind of 'intelligence' of the assembler, or is there 
a fundamental difference between Windows and Mac systems that 
makes keyboard editing on Windows less difficult than it is for the 
Mac in the case of 'uchr'? What's the difference? 

It may be clear I don't know an assembler from a wizard, but I'm
just amazed about the 'problem' at Apple and the (seemingly) 
simple solution for Windows in this regard. 



Yaap

-- 






RE: Devanagari

2002-01-22 Thread Marco Cimarosti

David Starner wrote:
 On Mon, Jan 21, 2002 at 02:20:17PM +0100, Marco Cimarosti wrote:
  What this means in practice for website developers is:
  
  1) SCSU text can only be edited with a text editor which 
 properly decodes
  the *whole* file on load and re-encodes it on save. On the 
 other hand, UTF-8
  text can also be edited using an encoding-unaware editor, 
 although non-ASCII
  text is invisible.
 
 True for users of Latin-based writing systems. Probably of little
 comfort to users of Indic or Chinese-based writing systems.

I was referring to the task of editing *source* files in HTML, XML, or other
computer languages and format. Most of the time, programmers and webmasters
are interested in changing the ASCII part of the file (mark-up,
instructions), which is the part which most likely contains bugs to be
fixed, or to need changes unrelated with the linguistic contents.

Of course, the people in charge of writing the *content*, need tools that
can display the actual characters. And this is true for users of Latin-based
writing system as well: imagine writing in French or German with all
occurrences of é, è, ä, ö, ü, etc. transformed into pairs of funny bytes.

 Better to stick with editors that are aware of your encoding.

Of course. Provided that one exists on your platform, and that you are not
bound to development tools which don't support it.

  2) SCSU text cannot be built by assembling binary pieces coming from
  external sources.
 
 It's not really designed for that. If you're assembling things, just
 run the output through a UTF-8 to SCSU converter.

Which translates to: SCSU is not appropriate for dynamic HTML pages, or for
encoding text inside any other kind of application.

More generally, SCSU is not appropriate as text encoding, but just as a
compression method for documents in their final form.

Ciao.
_ Marco




Re: Devanagari on MacOS 9.2 and IE 5.1

2002-01-22 Thread Yung-Fong Tang



It should be fine also on Netscape 6.2 

[EMAIL PROTECTED] wrote:
[EMAIL PROTECTED]">
  I spoke to fast. Upon taking a closer look at the file, the font was not set properly. MacOS 9.2, Indian Language Kit, Mac IE 5.1 and Devanagari MT as font face seem to display UTF-8 encoded Hindi just fine.Etienne
  
Date: Mon, 21 Jan 2002 10:24:16 -0800"[EMAIL PROTECTED]" [EMAIL PROTECTED] [EMAIL PROTECTED], [EMAIL PROTECTED]: [EMAIL PROTECTED]RE: DevanagariOn this subject, Win2K and IE5+ seem to do a nice job displaying UTF8-encoded Hindi. On the Mac, the Indian Language Kit provides for OS support and fonts (with MacOS 9.2 and above), but I have not been able to display Hindi (UTF8 encoded) with Mac's IE 5.1. Am I correct in assuming that the Mac version of IE does not support Hindi without a hack?Etienne

  Reply-To: [EMAIL PROTECTED]"Christopher J Fynn" [EMAIL PROTECTED] [EMAIL PROTECTED]Cc: "Aman Chawla" [EMAIL PROTECTED]RE: DevanagariDate: Mon, 21 Jan 2002 23:59:38 +0600AmanHere in Bhutan the Internet connection is still much worse than in mostplaces I've visited in India  Nepal (and the cost per minute is severaltimes higher) - believe me even then UTF-8 (or UTF-16) encoded pages do notdisplay noticeably slower than ASCII, ISCII or 8-bit font encoded pages -and I don't need to download any special plug-ins or fonts.- Chris--Christopher J FynnThimphu, Bhuta
n[EMAIL PROTECTED][EMAIL PROTECTED]
  
-Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]OnBehalf Of Aman ChawlaSent: 21 January 2002 10:57To: James Kass; UnicodeSubject: Re: Devanagari- Original Message -From: "James Kass" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:46 AMSubject: Re: Devanagari

  25% may not be 300%, but it isn't insignificant.  As you note, if themark-up were removed from both of those files, the percentage ofincrease would be slightly higher.  But, as connection speeds continueto improve, these differences are becoming almost minuscule.
  
  With regards to South Asia, where the most widely used modems areapprox. 14kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostlyunheard of, efficiency in data transmission is of paramount importance...how can we convince the south asian user to create websites in an encodingthat would make his client's 14 kbps modem as effective (rather,ineffective) as a 4.6 kbps modem?
  
  
  Hot After Christmas DEALS on just about everything!http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099
  
  Hot After Christmas DEALS on just about everything!http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099
  
  
  
  


Re: Devanagari

2002-01-21 Thread Michael Everson

At 23:19 -0600 2002-01-20, David Starner wrote:

There is no simple encoding scheme that will encode Indic text in
Unicode in one byte per character.

Raw 32-bit encoding treats all characters equally, doesn't it? :-)
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari

2002-01-21 Thread Michael Everson

At 00:39 -0500 2002-01-21, Aman Chawla wrote:

The issue was originally brought up to gather opinion from members of this
list as to whether UTF-8 or ISCII should be used for creating Devanagari web
pages. The point is not to criticise Unicode but to gather opinions of
informed persons (list members) and determine what is the best 
encoding for information interchange in South-Asian scripts...

If you want only local users who have ISCII fonts to read them, use 
ISCII. I wouldn't be able to read such pages, though, because I don't 
have any ISCII support under Mac OS X. I *do* have Unicode-based 
Devanagari supportm though.

The best encoding for information interchange is to use a SINGLE 
encoding, namely Unicode. For the Web, use UTF-8.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari

2002-01-21 Thread Lars Marius Garshol


* [EMAIL PROTECTED]
| 
| This is why I really wish that SCSU were considered a truly
| standard encoding scheme.  Even among the Unicode cognoscenti it
| is usually accompanied by disclaimers about private agreement only
| and not suitable for use on the Internet, where the former claim
| is only true because of the self-perpetuating obscurity of SCSU and
| the latter seems completely unjustified.

Do you know of any published web pages that use SCSU? I think that's
probably the place to start. I never add support for encodings I can't
find in actual use on the web. (Hint hint. :)

Note that IANA *does* consider SCSU a real encoding scheme, since
they've defined a tag for it. (Thanks to Markus Scherer, I know, but
it does help.)

--Lars M.





Re: Devanagari

2002-01-21 Thread Michael Everson

Aman,

What is it you want? To complain about the architecture of Unicode 
and UTF-8? For good or ill, it isn't going to change. Neither was it 
a conspiracy to suppress the non-English-speaking peoples of the 
world.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari

2002-01-21 Thread James Kass


Aman Chawla wrote,

 
 With regards to South Asia, where the most widely used modems are approx. 14
 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
 unheard of, efficiency in data transmission is of paramount importance...
 how can we convince the south asian user to create websites in an encoding
 that would make his client's 14 kbps modem as effective (rather,
 ineffective) as a 4.6 kbps modem?
 

This is a very good question.  How, indeed?

There are pros and cons for practically any situation and it seems that
you are asking these questions in order to help evaluate those pros and 
cons.

A while back, Benefits of Unicode was a very interesting thread on
this list and the results of those discussions formed the basis for some
web pages on the subject.  Tex Texin made the original page about the
benefits, which is on-line at: 
 http://www.geocities.com/i18nguy/UnicodeBenefits.html
The page offers links to other pages resulting from the same thread,
including a page by Suzanne Topping listing some of the disadvantages 
of Unicode.

One way to encourage members of your user community to embrace
the standard might be to offer a translation of some of that material 
in Unicode Hindi, with the respective authors' permissions, of course, 
and post it on the web.

With newer operating systems being Unicode-based, there are no special
plug-ins or filters involved.  A sophisticated and elegant writing system
like Devanagari justifies having a clever rendering system for display.
Unicode and OpenType support for such scripts are expected to be
built-in to operating systems.  As far as I can tell, under the current
OpenType model, OpenType support for Indic scripts under ISCII
encoding isn't possible because the features required not only for
plain text Devanagari, but also for typographically advanced Devanagari
are registered to the various Indic Unicode ranges.  At least as far
as Microsoft OS, a feature like the half-letter form won't be applied
to anything encoded in the ASCII or ISO-8859-01 range.  (Once again,
as far as I can tell.)

Encourage people to look towards the future.  Studying the trend over
the past several years as more groups and systems moved towards the
Unicode Standard might foster the belief that anyone converting their
existing files from a localized encoding into Unicode would be converting
those files for the last time.  On the other hand, converting from a 
localized encoding into ISCII would possibly mean that the material would 
eventually need to be converted into Unicode anyway.

I agree with you that efficient and effective data transmission is
extremely important and suggest that the most effective way to
exchange data is in a standard fashion which is supported worldwide.

Best regards,

James Kass.







Re: Devanagari

2002-01-21 Thread Guntupalli Karunakar

On Sun, 20 Jan 2002 23:57:29 -0500
Aman Chawla [EMAIL PROTECTED] wrote:


 With regards to South Asia, where the most widely used modems are approx. 14
 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
 unheard of, efficiency in data transmission is of paramount importance...
 how can we convince the south asian user to create websites in an encoding
 that would make his client's 14 kbps modem as effective (rather,
 ineffective) as a 4.6 kbps modem?
 
 Leave that most dont even have modems as that matter PCs, they use cybercafes, having 
8-10 users on a 33Kbps one
 And AFAIK (execpt the sites containing sanskrit manuscripts) no indian language 
website uses even ISCII , text there is in some adhoc font encoding. So pages cant be 
indexed by a search engine, you cant save page  read/print it later. 
Tools created (actually most use just Frontpage) to making indian lang 
webpages, dont even save pages in ISCII, leave alone export to ISCII.
Unicode solves lot of these issues, but existing vendors of ind lang software 
would find it inconvenient, as they cant lockin users to their products.
 Lot more can be said but I would go offtopic (I think I am already ;-).

Regards,
Karunakar




RE: Devanagari

2002-01-21 Thread Marco Cimarosti

Doug Ewell wrote:
 Devanagari text encoded in SCSU occupies exactly 1 byte per
 character, plus an additional byte near the start of the
 file to set the current window (0x14 = SC4).

The problem is what happens if that very byte gets corrupted for any
reason...

If an octet is erroneously deleted, changed or added from an UTF-8 stream,
only a single character would be corrupted. If the same thing happens to the
window-setting byte of a SCSU (or other similar zany formats), the whole
stream turns into garbage.

What this means in practice for website developers is:

1) SCSU text can only be edited with a text editor which properly decodes
the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
text can also be edited using an encoding-unaware editor, although non-ASCII
text is invisible.

2) SCSU text cannot be built by assembling binary pieces coming from
external sources. E.g., you cannot get a SCSU-encoded template file and fill
in the blanks with customer data coming from a SCSU-encoded database: each
time you insert a piece of text coming from the database, you delete the
current window information, turning into garbage the rest of the file. On
the other hand, UTF-8 allows this, provided that the integrity of each
multi-byte sequence is maintained.

3) A SCSU page can only be accepted by browsers and e-mail readers that are
able to decode it. On the other hand, UTF-8 also works on old ASCII-based
browsers, although non-ASCII text is clearly not properly displayed.

_ Marco




SCSU (was: Re: Devanagari)

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 1:33:23 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Do you know of any published web pages that use SCSU? I think that's
 probably the place to start. I never add support for encodings I can't
 find in actual use on the web. (Hint hint. :)

This becomes a vicious circle, as it is just as reasonable to say that I 
never create Web pages in encodings that existing browsers can't support.

I'm not sure what is the best way to break this circle, except that when I do 
finally set up a Web site (\u263a) I might include a parallel SCSU version 
along with the UTF-8 version, along with a brief description of SCSU.

-Doug Ewell
 Fullerton, California




SCSU (was: Re: Devanagari)

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 5:20:55 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Doug Ewell wrote:
 Devanagari text encoded in SCSU occupies exactly 1 byte per
 character, plus an additional byte near the start of the
 file to set the current window (0x14 = SC4).

 The problem is what happens if that very byte gets corrupted for any
 reason...

 If an octet is erroneously deleted, changed or added from an UTF-8 stream,
 only a single character would be corrupted. If the same thing happens to the
 window-setting byte of a SCSU (or other similar zany formats), the whole
 stream turns into garbage.

Yes, SCSU is stateful and the corruption of a single tag, or argument to a 
tag, could potentially damage large amounts of text.  I know this was a big 
problem in the days of devices and transmission protocols that did little or 
no error correction.  I honestly don't know how big a problem it is today.

 What this means in practice for website developers is:

 1) SCSU text can only be edited with a text editor which properly decodes
 the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
 text can also be edited using an encoding-unaware editor, although non-ASCII
 text is invisible.

I have edited SCSU text using a completely encoding-ignorant MS-DOS editor.  
Of course I couldn't edit the SCSU control bytes intelligently, but then I 
can't edit multibyte UTF-8 sequences intelligently with it either.

 2) SCSU text cannot be built by assembling binary pieces coming from
 external sources. E.g., you cannot get a SCSU-encoded template file and fill
 in the blanks with customer data coming from a SCSU-encoded database: each
 time you insert a piece of text coming from the database, you delete the
 current window information, turning into garbage the rest of the file.

The current window information is not deleted, it is carried over into any 
adjoining text that does not redefine it.  (This could have its own 
repercussions, of course.)

 3) A SCSU page can only be accepted by browsers and e-mail readers that are
 able to decode it. On the other hand, UTF-8 also works on old ASCII-based
 browsers, although non-ASCII text is clearly not properly displayed.

Same as 1).  If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you 
have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences 
that readers must know how to decode.  SCSU does use states, like any 
compression scheme, so an encoding-ignorant tool will probably have more 
trouble with SCSU than with UTF-8.  But I was not arguing to foist SCSU on an 
unprepared world, I was suggesting that the world should prepare.  \u263a

-Doug Ewell
 Fullerton, California




RE: Devanagari

2002-01-21 Thread Christopher J Fynn

Aman

Here in Bhutan the Internet connection is still much worse than in most
places I've visited in India  Nepal (and the cost per minute is several
times higher) - believe me even then UTF-8 (or UTF-16) encoded pages do not
display noticeably slower than ASCII, ISCII or 8-bit font encoded pages -
and I don't need to download any special plug-ins or fonts.

- Chris

--
Christopher J Fynn
Thimphu, Bhutan

[EMAIL PROTECTED]
[EMAIL PROTECTED]


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Aman Chawla
 Sent: 21 January 2002 10:57
 To: James Kass; Unicode
 Subject: Re: Devanagari


 - Original Message -
 From: James Kass [EMAIL PROTECTED]
 To: Aman Chawla [EMAIL PROTECTED]; Unicode
 [EMAIL PROTECTED]
 Sent: Monday, January 21, 2002 12:46 AM
 Subject: Re: Devanagari


  25% may not be 300%, but it isn't insignificant.  As you note, if the
  mark-up were removed from both of those files, the percentage of
  increase would be slightly higher.  But, as connection speeds continue
  to improve, these differences are becoming almost minuscule.

 With regards to South Asia, where the most widely used modems are
 approx. 14
 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
 unheard of, efficiency in data transmission is of paramount importance...
 how can we convince the south asian user to create websites in an encoding
 that would make his client's 14 kbps modem as effective (rather,
 ineffective) as a 4.6 kbps modem?






RE: Devanagari

2002-01-21 Thread [EMAIL PROTECTED]


On this subject, Win2K and IE5+ seem to do a nice job displaying UTF8-encoded Hindi. 
On the Mac, the Indian Language Kit provides for OS support and fonts (with MacOS 9.2 
and above), but I have not been able to display Hindi (UTF8 encoded) with Mac's IE 
5.1. Am I correct in assuming that the Mac version of IE does not support Hindi 
without a hack?

Etienne

Reply-To: [EMAIL PROTECTED]
 Christopher J Fynn [EMAIL PROTECTED] [EMAIL PROTECTED]Cc: Aman Chawla 
[EMAIL PROTECTED]
 RE: DevanagariDate: Mon, 21 Jan 2002 23:59:38 +0600

Aman

Here in Bhutan the Internet connection is still much worse than in most
places I've visited in India  Nepal (and the cost per minute is several
times higher) - believe me even then UTF-8 (or UTF-16) encoded pages do not
display noticeably slower than ASCII, ISCII or 8-bit font encoded pages -
and I don't need to download any special plug-ins or fonts.

- Chris

--
Christopher J Fynn
Thimphu, Bhutan

[EMAIL PROTECTED]
[EMAIL PROTECTED]


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Aman Chawla
 Sent: 21 January 2002 10:57
 To: James Kass; Unicode
 Subject: Re: Devanagari


 - Original Message -
 From: James Kass [EMAIL PROTECTED]
 To: Aman Chawla [EMAIL PROTECTED]; Unicode
 [EMAIL PROTECTED]
 Sent: Monday, January 21, 2002 12:46 AM
 Subject: Re: Devanagari


  25% may not be 300%, but it isn't insignificant.  As you note, if the
  mark-up were removed from both of those files, the percentage of
  increase would be slightly higher.  But, as connection speeds continue
  to improve, these differences are becoming almost minuscule.

 With regards to South Asia, where the most widely used modems are
 approx. 14
 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
 unheard of, efficiency in data transmission is of paramount importance...
 how can we convince the south asian user to create websites in an encoding
 that would make his client's 14 kbps modem as effective (rather,
 ineffective) as a 4.6 kbps modem?




Hot After Christmas DEALS on just about everything!
http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099




Re: Devanagari

2002-01-20 Thread James Kass


Aman Chawla wrote,



 I would be grateful if I could get opinions on the following:

 1. Which encoding/character set is most suitable for using Hindi/Marathi
 (both of which use Devanagari) on the internet as well as in databases, and
 why? In your response, please refer to:
 http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html,
 particularly the following paragraphs:
snip

Unicode is the best.  It is the World's standard for computer encoding, and,
as such, offers the best possibility that text can be exchanged around the
globe and cross-platform.

The arguments about relative size are true, but in this day and age are
considered unimportant.  Graphics files are extremely large in comparison
with text files of any script and so are sound files.  Devanagari UTF-8 is
three bytes.  The four byte UTF-8 sequences so far are only used for
Plane One Unicode and up.

 3. With reference to the previous question, can programs that convert
 the myriad Devangari encodings in use today to a standard encoding
 (question 1) be made freely available, and how?

Yes, converters exist and are being distributed.  Just go to the Google
search engine and input character conversion Unicode into the box.
Look for ICU and Rosette, to name a few.  You might even run across
Mark Leisher's download page at:
  http://crl.nmsu.edu/~mleisher/download.html
and see the PERL script for converting the Naidunia Devanagari encoding
to UTF-16.

 4. Is there any search engine on the internet that maintains an up to date
 index of sites in Devanagari? If not, what can be done to encourage
 proprietary search engines to support Hindi? Google supposedly has a
 Hindi language option, but surprise, it's in Roman script! Several emails
 to them have elicited the response: At the moment we don't support
 Devanagari...

This appears to be because Google is converting UTF-8 strings input
to the search words box into decimal NCRs.

Pasted यूनिकोड क्या है into the Google box, it displays 
fine.  Since the
What is Unicode? pages are popular and have been up for a while,
thought that it would have a good chance of being indexed.  But,
there were no hits for the resulting search string:
#2351;#2370;#2344;#2367;#2325;#2379;#2337;
#2325;#2381;#2351;#2366; #2361;#2376;
...which is not surprising since the actual page doesn't use NCRs.

Best regards,

James Kass.








Re: Devanagari Rupee Symbol

2002-01-20 Thread Michael Everson

At 11:22 -0500 2002-01-20, Aman Chawla wrote:
I am unable to find the Devanagari Rupee sign encoded in Unicode? Is 
it encoded? If not, why?


U+20A8.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Devanagari

2002-01-20 Thread Asmus Freytag

At 12:48 AM 1/20/02 -0800, James Kass wrote:
The arguments about relative size are true, but in this day and age are
considered unimportant.  Graphics files are extremely large in comparison
with text files of any script and so are sound files.  Devanagari UTF-8 is
three bytes.  The four byte UTF-8 sequences so far are only used for
Plane One Unicode and up.

If the argument refers to 4-byte sequences for Devanagari, it is not
factually 'true', as James points out.

More to the point is the following observation: HTML or similar mark-up
languages account for an ever growing percentage of transmission of
text - even in e-mail.

The fact that UTF-8 economizes on the storage for ASCII characters, is a
benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and
claims a significant fraction of the data.

A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead
for Devanagari as claimed. Add to that James' observation on graphics files,
many of which accompany even the simplest HTML documents and you get a
percentage difference between the sizes of an English and Devanagari website
(i.e. in its entirety) that's well within the fluctuation of the typical
length in characters, for expressing the same concept in different languages.

In other words, contrary to the claims made by the argument, it is hard to
predict that this structure of UTF-8 will have an observable impact on
exchanging data - other than psychological perhaps.

In many size constrained application areas it may pay off to do compression.
http://www.unicode.org/unicode/reports/tr6 shows how one can compress
Unicode Data in Devanagari to a size comparable to that of 8-bit ISCII.
However, interchange of this format (SCSU) requires consenting parties.

A./




Re: Devanagari

2002-01-20 Thread Aman Chawla

 The fact that UTF-8 economizes on the storage for ASCII characters, is a
 benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and
 claims a significant fraction of the data.

 A UTF-8 encoded HTML file, will therefore have (percentage-wise) less
overhead
 for Devanagari as claimed. Add to that James' observation on graphics
files,
 many of which accompany even the simplest HTML documents and you get a
 percentage difference between the sizes of an English and Devanagari
website
 (i.e. in its entirety) that's well within the fluctuation of the typical
 length in characters, for expressing the same concept in different
languages.

The point was that a UTF-8 encoded HTML file for an English web page
carrying say 10 gifs would have a file size one-third that for a Devanagari
web page with the same no. of gifs - even if you take into account the
fluctuation of the typical length in characters, for expressing the same
concept in different languages. This is because in some cases one language
may express a concept more compactly while in other cases it may not, and on
the whole this effect would balance out and can therefore be neglected.
Therefore transmission of a Devanagari web page over a network would take
thrice as long as that of an English web page using the same images and
presenting the same information.






Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 16:49:17 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 The point was that a UTF-8 encoded HTML file for an English web page
 carrying say 10 gifs would have a file size one-third that for a Devanagari
 web page with the same no. of gifs...
 Therefore transmission of a Devanagari web page over a network would take
 thrice as long as that of an English web page using the same images and
 presenting the same information.

This conclusion ignores two obvious points, which Asmus already made:

(1) The 10 GIFs, each of which may well be larger than the HTML file, take 
the same amount of space regardless of the encoding of the HTML file.  The 
total number of bytes involved in transmitting a Web page includes 
everything, HTML and graphics, but the purported factor of 3 applies only 
to the HTML.

(2) The markup in an HTML file, which comprises a significant portion of the 
file, is all ASCII.  So the factor of 3 doesn't even apply to the entire 
HTML file, only the plain-text content portion.

In addition, text written in Devanagari includes plenty of instances of 
U+0020 SPACE, plus CR and/or LF, each of which which occupies one byte each 
regardless of the encoding.

I think before worrying about the performance and storage effect on Web pages 
due to UTF-8, it might help to do some profiling and see what the actual 
impact is.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread Christopher Vance

On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote:
: The point was that a UTF-8 encoded HTML file for an English web page
: carrying say 10 gifs would have a file size one-third that for a Devanagari
: web page with the same no. of gifs - even if you take into account the
: fluctuation of the typical length in characters, for expressing the same
: concept in different languages. This is because in some cases one language
: may express a concept more compactly while in other cases it may not, and on
: the whole this effect would balance out and can therefore be neglected.
: Therefore transmission of a Devanagari web page over a network would take
: thrice as long as that of an English web page using the same images and
: presenting the same information.

And the whole UTF-8 Devanagari page is probably still smaller than
even one of the .gif files.

-- 
Christopher Vance




Re: Devanagari

2002-01-20 Thread David Starner

On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote:
 The point was that a UTF-8 encoded HTML file for an English web page
 carrying say 10 gifs would have a file size one-third that for a Devanagari
 web page with the same no. of gifs 

The point is, that the text for a short webpage is 10k for English and
30k for Devanagari, the HTML will be another 10k for English and another
10k for Devanagari, and the graphics will another 30k for English and
another 30k for Devanagari, meaning that the total will be 50k for
English and 70k for Devanagari - 40% markup, not 200%. Adding a 150k
graphic would make it 200k for English and 220k for Devangari, making it
a 10% markup. 

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




Re: Devanagari

2002-01-20 Thread James Kass


Doug Ewell wrote,

 
 I think before worrying about the performance and storage effect on Web pages 
 due to UTF-8, it might help to do some profiling and see what the actual 
 impact is.
 

The What is Unicode? pages offer a quick study.

14808 bytes (English)
15218 bytes (Hindi)
10808 bytes (Danish)
11281 bytes (French)
 9682 bytes (Chinese Trad.)

(The English page includes links to all the other scripts, but the individual
script pages only link back to the English page.  So, the English page is a
bit larger than the other pages for this reason, not a fair test if we only
count the English and Hindi pages.)

The Unicode logo gif at the top left corner of each of these pages takes
 bytes.  A screen shot of the beginning of the Hindi page takes
37569 bytes as a gif, the small portion cropped and attached takes
4939 bytes.

The What is Unicode? pages are at:
http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Best regards,

James Kass.




hindiwhatis.gif
Description: GIF image


Re: Devanagari

2002-01-20 Thread Barry Caplan

At 10:44 PM 1/20/2002 -0500, you wrote:
Taking the extra links into account the sizes are:
English: 10.4 Kb
Devanagari: 15.0 Kb
Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
of documents/manuscripts (in plain text) in Devanagari, this factor could be
as high as approx. 3 using UTF-8 and around 1 using ISCII.


Yes, but that is this page only. Are you suggesting that all pages will 
vary by that factor? Of course not.

Please consider whether the space *in practice* is a limiting factor. It 
seems that folks on the list feel it is not. Not for bandwidth limited 
applications, and not for disk space limited applications.

The amount of space devoted to plain text of any language on a typical web 
page is microscopic compared tot he markup, images, sounds, and other files 
also associated with the web page.

Are you suggesting that utf-8 ought to have been optimized for Devanagari text?

Barry Caplan
www.i18n.com -- coming soon...






Re: Devanagari

2002-01-20 Thread Aman Chawla

- Original Message -
From: James Kass [EMAIL PROTECTED]
To: Aman Chawla [EMAIL PROTECTED]; Unicode
[EMAIL PROTECTED]
Sent: Monday, January 21, 2002 12:46 AM
Subject: Re: Devanagari


 25% may not be 300%, but it isn't insignificant.  As you note, if the
 mark-up were removed from both of those files, the percentage of
 increase would be slightly higher.  But, as connection speeds continue
 to improve, these differences are becoming almost minuscule.

With regards to South Asia, where the most widely used modems are approx. 14
kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly
unheard of, efficiency in data transmission is of paramount importance...
how can we convince the south asian user to create websites in an encoding
that would make his client's 14 kbps modem as effective (rather,
ineffective) as a 4.6 kbps modem?





Re: Devanagari

2002-01-20 Thread David Starner

On Sun, Jan 20, 2002 at 10:44:00PM -0500, Aman Chawla wrote:
 For sites providing archives
 of documents/manuscripts (in plain text) in Devanagari, this factor could be
 as high as approx. 3 using UTF-8 and around 1 using ISCII.

Uncompressed, yes. It shouldn't be nearly as bad compressed - gzip, zip,
bzip2, or whatever your favorite tool is. You could also use UTF-16 or
SCSU, which will get it down to about 2 or about 1, respectively.

What's your point in continuing this? Most of the people on this list
already know how UTF-8 can expand the size of non-English text. There's
nothing we can do about it. Even if you had brought it up when UTF-8
was being designed, there's not much anyone could have done about it.
There is no simple encoding scheme that will encode Indic text in
Unicode in one byte per character. 

It's the pigeonhole principle in action - if you need to encode 150,000
characters, you can't encode each one in one or two bytes, and while you
can write encodings that approach that for normal text, they aren't
going to be simple or pretty.

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




Re: Devanagari

2002-01-20 Thread Aman Chawla



- Original 
Message -From: "David Starner" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]Cc: "James Kass" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:19 
AMSubject: Re: Devanagari What's your point in continuing 
this? Most of the people on this list already know how UTF-8 can expand 
the size of non-English text.The issue was originally brought up to 
gather opinion from members of thislist as to whether UTF-8 or ISCII should 
be used for creating Devanagari webpages. The point is not to criticise 
Unicode but to gather opinions ofinformed persons (list members) and 
determine what is thebest encodingfor informationinterchange in 
South-Asian scripts...


Re: Devanagari

2002-01-20 Thread Geoffrey Waigh

On Sun, 20 Jan 2002, Aman Chawla wrote:

 Taking the extra links into account the sizes are:
 English: 10.4 Kb
 Devanagari: 15.0 Kb
 Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives
 of documents/manuscripts (in plain text) in Devanagari, this factor could be
 as high as approx. 3 using UTF-8 and around 1 using ISCII.

Well a trivial adjustment is to use UTF-16 to store your documents if you
know they are going to be predominantly Devangari.  Or if you have so much
text that the number of extra disks is going to be painful, use SCSU to
bring it very close to the ISCII ratio.  Of course I would note that you
can store millions of pages of plain-text on a single harddisk these
days.  If you going to be storing so many hundreds of millions of pages of
plain text that the number of extra disks is a bother, I am amazed that
none of it might be outside the ISCII repetoire.  And this huge document
archive has no graphics component to go with it...

But the real reason for publishing the data in Unicode on the web is so
people not using a machine specially configured for ISCII will still be
able to read and process the data.

[then later wrote:]

 With regards to South Asia, where the most widely used modems are
 approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where
 broadband/DSL is mostly unheard of, efficiency in data transmission is
 of paramount importance... how can we convince the south asian user to
 create websites in an encoding that would make his client's 14 kbps
 modem as effective (rather, ineffective) as a 4.6 kbps modem?

Can you read 500 characters per second?  So long as they are receiving
only plain text, even this dwaddling speed is not going to impact them.
People wanting to efficiently transfer data will use a compression
program.

Geoffrey








Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 20:49:00 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Usually, when someone offers
 a large body of plain text in any script, files are compressed 
 in one way or another in order to speed up downloads.

This is why I really wish that SCSU were considered a truly standard 
encoding scheme.  Even among the Unicode cognoscenti it is usually 
accompanied by disclaimers about private agreement only and not suitable 
for use on the Internet, where the former claim is only true because of the 
self-perpetuating obscurity of SCSU and the latter seems completely 
unjustified.

Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus 
an additional byte near the start of the file to set the current window (0x14 
= SC4).

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 21:49:02 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 The issue was originally brought up to gather opinion from members of this
 list as to whether UTF-8 or ISCII should be used for creating Devanagari web
 pages. The point is not to criticise Unicode but to gather opinions of
 informed persons (list members) and determine what is the best encoding for 
 information interchange in South-Asian scripts...

It seems that the only point against Unicode compared to ISCII is the 
resulting document size in bytes, and this one point is being given 100% 
focus in the comparison.

If the actual question is, What is the most efficient encoding for 
Devanagari text, in terms of bytes, using only the most commonly encountered 
encoding schemes and no external compression? then of course you will have 
loaded the question in favor of ISCII.

But when you consider that more browsers today around the world (not just in 
India) are equipped to handle Unicode than ISCII, and that Unicode allows not 
only the encoding of ASCII and Devanagari but the full complement of Indic 
scripts (Oriya, Gujarati, Tamil...) as well as any other script on the planet 
that you could realistically want to encode, you will probably have to 
rethink the cost/benefit tradeoff of Unicode.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread David Starner

On Mon, Jan 21, 2002 at 12:57:39AM -0500, [EMAIL PROTECTED] wrote:
 This is why I really wish that SCSU were considered a truly standard 
 encoding scheme.  Even among the Unicode cognoscenti it is usually 
 accompanied by disclaimers about private agreement only and not suitable 
 for use on the Internet, where the former claim is only true because of the 
 self-perpetuating obscurity of SCSU and the latter seems completely 
 unjustified.

Does Mozilla support it? If someone's willing to spend a little time,
adding it to Mozilla is one way to make it more generally useable. And
maybe then IE will get nudged into playing a little catchup . . .

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




RE: Devanagari question

2000-11-15 Thread Ayers, Mike


 From: Rick McGowan [mailto:[EMAIL PROTECTED]]

 Mike Ayers wrote:

  The last I knew,
  computer-savvy Taiwan and Hong Kong were continuing to invent new
  characters.  In the end, the onus is on the computer to
 support the user.

 Yes, the computer should support the user, but... The
 invention of new characters to serve multitudes is OK, and
 international standards will probably continue to support
 that.  But I don't think it's reasonable or appropriate to
 keep inventing new characters willy-nilly for individuals (as
 reported), and then expect them to be added to an
 international standard.  That's silly.  The onus is not on
 international standards to support the whimsical production
 of novel, rarely-used, or nonce characters of the type
 reported to be generated.

That is not established.  The degree to which computer or user will
dictate what will and will not be permitted has yet to be decided.
Certainly, I already have full support for any words that I care to make up
- I need merely spell them.  Since hanzi are words-as-characters, the issue
is much more cloudy, since the position of the Unicode specification (due to
the encoding method used) is that hanzi are characters-only.  This may not
be the final solution.

 In any case, I still have never seen actual documentary
 evidence that would prove to me that in fact Taiwan and Hong
 Kong *ARE* creating new characters at the drop of a hat.
 People just keep saying that to scare everyone.  Sounds like
 an urban myth to me.

Good point.  I will go seek a definitive answer.  Not much point in
discussing this if it doesn't really happen.


/|/|ike



Re: Devanagari question

2000-11-14 Thread Antoine Leca

Mark Davis wrote:
 
 The Unicode Standard does define the rendering of such combinations, which
 is in the absence of any other information to stack outwards.
 
 A dumb implementation would simply move
 the accent outwards if there was in the same position. This will not
 necessarily produce an optimal positioning, but should be readable.

Note that it also should increase the line spacing.
Note also that the renderer should notice that event, even in when there
is interleaved unrelevant (zero-width) characters.
And we are using a dumb implementation.

Anyway, my point was not about this, which are as you say, the basics of
the dumbest renderer.
No, I was thinking about the implications of mixing Nagari consonants
with kana diacritics (or the contrary); or circling (U+20DD) around
Indian conjuncts, or else around superscript digits; or the Tibetan
subjoined below Latin letters (how do they attach?); or Jamos followed
by a virama or a Telugu length mark. Etc.
My point was it is *not* a good idea to render an out-of-context 
Telugu length mark (U+0C55), when it follows for example a Latin vowel,
as a macron, even if this is the "logical" behaviour. Such code will be,
IMHO, just waste.


 If it take megabytes of code to do [that] there is probably something
 else wrong.

I do not count a dumb implementation as "decent".

And yes, I was overemphasing with "megabytes". The OT support in FreeType,
which does only a small part of this task, is only 315 Kbytes of C code.
So I expect the not-so-dumb renderer based on it, to be around 0.5 megabyte.
Which does not take in account the code embeeded in the OT fonts themselves.

As a result, yes, please remove the "s".

 
Antoine



RE: Devanagari question

2000-11-14 Thread Ayers, Mike


 From: D.V. Henkel-Wallace [mailto:[EMAIL PROTECTED]]


 At 06:30 2000-11-14 -0800, Marco Cimarosti wrote:

 But my point was: not even Mr. Ethnologue himself knows
 exactly *which*
 combinations are meaningful, in all orthographic system.
 And, clearly, no
 one can figure out which combinations may become meaningful
 in the *future*
 -- e.g. when a previously unwritten language gets its
 orthography, or when
 the spelling of an already written language gets changed.

 Sadly, it seems unlikely that any furture change or adoption
 of orthography
 will use characters not already supported by the then major computer
 systems.  In fact the trend seems to be the other way, viz
 Spain's changing
 of its collation rules.

I do not think that this is a trend.  The last I knew,
computer-savvy Taiwan and Hong Kong were continuing to invent new
characters.  In the end, the onus is on the computer to support the user.
Only during the current frenzy of computerization is the reverse permitted -
this will pass.

 For a minority language (which all remaining unwritten
 languages are) the
 pressure will be strong to use existing combinations (since
 they won't
 constitute a large enough community for people to write
 special rendering
 support).

That depends on how you look at it.  From what I understand (which I
freely admit I have learned only from this list), Indic languages tend to be
supported in toto, and therefore even the currently unwritten ones will
belong to a highly non-minority language family.


$.02,

/|/|ike



RE: Devanagari question

2000-11-14 Thread Rick McGowan

Mike Ayers wrote:

 The last I knew,
 computer-savvy Taiwan and Hong Kong were continuing to invent new
 characters.  In the end, the onus is on the computer to support the user.

Yes, the computer should support the user, but... The invention of new characters to 
serve multitudes is OK, and international standards will probably continue to support 
that.  But I don't think it's reasonable or appropriate to keep inventing new 
characters willy-nilly for individuals (as reported), and then expect them to be added 
to an international standard.  That's silly.  The onus is not on international 
standards to support the whimsical production of novel, rarely-used, or nonce 
characters of the type reported to be generated.

In any case, I still have never seen actual documentary evidence that would prove to 
me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a 
hat.  People just keep saying that to scare everyone.  Sounds like an urban myth to me.

Rick

 


RE: Devanagari question

2000-11-14 Thread Thomas Chan

On Tue, 14 Nov 2000, Rick McGowan wrote:

 Mike Ayers wrote:
  The last I knew,
  computer-savvy Taiwan and Hong Kong were continuing to invent new
  characters.  In the end, the onus is on the computer to support the user.
 
 Yes, the computer should support the user, but... The invention of new characters to 
serve multitudes is OK, and international standards will probably continue to support 
that.  But I don't think it's reasonable or appropriate to keep inventing new 
characters willy-nilly for individuals (as reported), and then expect them to be 
added to an international standard.  That's silly.  The onus is not on international 
standards to support the whimsical production of novel, rarely-used, or nonce 
characters of the type reported to be generated.
 In any case, I still have never seen actual documentary evidence that would prove to 
me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a 
hat.  People just keep saying that to scare everyone.  Sounds like an urban myth to 
me.

I think there is some confusion between "new characters" in the sense that
they were never available in any standard, but which are taken from
pre-existing print sources, and now people would like to properly add
them; versus "new characters" that were made up "yesterday" for frivolous
reasons.


Thomas Chan
[EMAIL PROTECTED]





RE: Devanagari question

2000-11-13 Thread Marco Cimarosti

Antoine Leca wrote:
 My understanding is that there are a number of similar cases,
 which are not
 officially prohibited (AFAIK), but does not carry any sense.
 For example, how about digits followed by accents (as
 combining marks)?
 Or the kana voicing/voiceless combining marks, when they
 follow anything other than hiragana or katakana?

I think that the original idea behind having combining marks in Unicode was
that *any* combination of base + diacritic should be permitted, and be
handled decently by rendering engines.

The reason for this is that there are thousands of languages in the world,
and their orthographies may require an uncommon usage of diacritics. E.g.,
talking about katakana, I think that the orthography of the Ainu language
requires some combinations of syllable + voiced sign (or was it syllable +
semivoiced sign) that do not exist in Japanese.

If font designers and d. engines implementers insist in the idea that an
"accented letter" may be rendered only if an ad-hoc glyph has been
anticipated in the font, many minority languages will never have a chance of
being supported at a reasonable cost.

This is not to say that fonts should *not* have precomposed glyphs.
Precomposed glyphs are useful in *some* cases, for providing a *better*,
*nicest*, rendering for *some* troublesome combinations.

Less common combinations, used in less known languages, may get along with a
less-than-perfect rendering -- but *no* rendering at all is not acceptable,
IMHO!

Sorry for stating the obvious, but I see that actual implementation often
have an attitude towards precomposed glyphs that I don't see a reason for.

_ Marco

__
La mia e-mail è ora: My e-mail is now:
   marco.cimarostiªeurope.com   
(Cambiare "ª" in "@")  (Change "ª" to "@")
 

__
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



Re: Devanagari question

2000-11-13 Thread Mark Davis

The Unicode Standard does define the rendering of such combinations, which
is in the absence of any other information to stack outwards.
Implementations that can't do that will either overstrike, or use some other
fallback rendering.

A sophisticated rendering will use positioning such as control point
matching to get optimal positioning. A dumb implementation would simply move
the accent outwards if there was in the same position. This will not
necessarily produce an optimal positioning, but should be readable.

It may take a non-trivial amount of code to do the former (especially if it
means adding control point hinting, as in TrueType). If it take megabytes of
code to do the latter there is probably something else wrong.

Mark
- Original Message -
From: "Antoine Leca" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, November 13, 2000 10:11
Subject: Re: Devanagari question


 Marco Cimarosti wrote:
 
  Antoine Leca wrote:
   My understanding is that there are a number of similar cases,
   which are not
   officially prohibited (AFAIK), but does not carry any sense.
 
  I think that the original idea behind having combining marks in Unicode
was
  that *any* combination of base + diacritic should be permitted,

 The fact that it is permitted (as I said, they "are not prohibited")
 does not per se give them any sense...
 This was my point, but I was not clear enough.

  and be handled decently by rendering engines.

 The question here is the meaning of "decently".

 I beg your pardon, but as the programmer of a rendering engine, I cannot
 agree that I should spend hours and days, and furthermore adding megabytes
 of code, to render "decently" combinations like digits + accents (by
 decently, I mean I should check if the glyph for the digit have ascender
 above x-height, or being of narrower width, and then adjust the position
of
 the diacritic accordingly; similarly, adjusting the descender position of
the
 Nagari virama according to the descender depth of a preceding "g" or "j"
or
 "y".)

 At the contrary, I believe that when a combination is not expected, the
 renderer should have a very basic and straightforward behaviour, and just
 "print" the default glyphs in order, with overstriking when the second
glyph
 is a combining mark. Doing something more complex, in addition to be IMHO
 a complete lost of time for both the programmer and the users (to load
unusued
 code), is also likely to give some users the idea that using some weird
 combinations are handled this ("clever") way everywhere, thus leading to
chaos
 when the datas will be brought elsewhere.


  If font designers and d. engines implementers insist in the idea that an
  "accented letter" may be rendered only if an ad-hoc glyph has been
  anticipated in the font, many minority languages will never have a
chance of
  being supported at a reasonable cost.

 I never say (nor I hope I implied) such an idea.

 Now, insisting that any renderer should align properly any diacritic on
the
 top (or bottom) middle of the I, M and W glyph, will have for net result
 that nobody will never be able to create any renderer...


  Less common combinations, used in less known languages, may get along
with a
  less-than-perfect rendering -- but *no* rendering at all is not
acceptable,

 Where anyone stated such an idea?


 Antoine




RE: Devanagari Consonant RA Rule R2

2000-11-09 Thread James E. Agenbroad

On Wed, 8 Nov 2000, Apurva Joshi wrote:

 The RA[sup] is seen applied to the independent vowel Vocalic R (U+ 090B) in
 printed samples in Sanskrit.
 
 There are atleast the following words that contain the above:
 NaiRiTa (the name of a demon)
 = 0928 090B Ra[sup] 0924
 NaiRiTi (the goddess Durga, slayer of demons)
 = 0928 090B Ra[sup] 0924 0940
 NaiRiTYa (south-west)
 = 0928 090B Ra[sup] 0924 094D 092F 
 
 The Devanagari shaping engine in Uniscribe currently recognises a 0930 094D
 preceding only consonants, to be duely reordered to the end of the syllable
 and replaced with Ra[sup]. Whether this be extended to independent vowels
 had figured in internal discussions when the shaping engine was being
 planned. To the best of my knowledge, extending this to be applicable to
 Vocalic R would be a special case, because Ra[sup] is not seen to be applied
 to any other Indic vowel in words that are native to Indic languages.  
 
 Would be glad to hear from any expert on this list, if there are
 phonemes/sounds in any language, which when transliterated into Devanagari,
 would require the Ra[sup] to be applied to an independent vowel. 
 eg. vowel E Ra[sup] etc.
 
 Thanks,
 -apurva
 
 -Original Message-
 From: Eric Mader/Cupertino/IBM [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, November 08, 2000 10:24 AM
 To: Unicode List
 Subject: Devanagari Consonant RA Rule R2
 
 
 Hello,
 
 In the Devanagari section of the standard, rule R2, on page 217 of the
 version 3.0 standard, states, "If the dead consoant RA[d] preecesd either a
 consonant *or an independent vowel,* then it is replaced by the superscript
 nonspacing mark RA[sup]..."
 
 I've never seen a RA[sup] applied to an indpenedent vowel, and non of the
 software I can find that renders Devanagari does this; they all render a
 dead RA followed by the vowel. Is the rule in error, or is it written to
 cover some obscure case that most software doesn't bother with?
 
 Eric Mader
 
 
Wednesday, November 8, 2000
First, I'm not an expert in Sanskrit but have done some work with
Devanagari.  I think at figure 9-3 (4) on page 214 and at R2 on page 217 
Unicode 3.0 overstates and mistates the situation a bit.  What is being
described is, I believe, a rendering issue, not an encoding issue.  
Instead of involving an independent vowel, it involves the r consonant,
U+0930, immediately followed by the R vowel sign (matra), U+0943, which
happens to get rendered as the independent vowel, U+090B with the
superscript R, reph, above it--with no halant  between the consonant
and the vowel sign.  On page 24 of Hester Lambert's Introduction to
Devanagari, "The vowel sign of  [U+090B] is not written with [U+0930,
094D] The character representing {0930, 094D] with [U+090B] is written
with the superscribed stroke used to represent [0930, 094D] when it is to
be realized before another consonant with character without an intervening
vowel {i.e. reph]. This stroke is placed over the vowel character
[U+090B], as in [U+0928, 093F, 090B, reph, 0924, 093F] nirrti."  The order
of filing 'nirri' (dot under second r) in Monier Williams Sanskrit-English
dictionary (page 554, column 2) tends to confirm this interpretation: It
has after nirUha, nirri, nirrich and nirrij (with a dot below the second
r) followed by nire.  It is possible that this peculiar rendering practice
would extend to the RA followed by U+0944, 0962 or 0963 but they seem to
me too unlikely to dwell on.  I suppose (by analogy to having two ways to
encode many letters with diacritics) Unicode could allow two ways to
encode what looks like "R vowel with reph"; at present it describes the
one with a halant but is silent about the display when the r consonant is
immediately followed by the r matra, U+0930, 0943. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





  1   2   >