Re: Devanagari and Subscript and Superscript
I missed this yesterday. Plug Gulp wrote: > General support for all characters, words and sentences could be > achieved by just three new formatting characters, e.g. SCR, SUP and > SUB, similar to the way other formatting characters such as ZWS, ZWJ, > ZWNJ etc are defined. The new formatting characters could be defined > as: > > SCR: In a character stream, all the characters following this > formatting character shall be treated as [...] > > SUP: In a character stream, all the characters following this > formatting character shall be treated as [...] > > SUB: In a character stream, all the characters following this > formatting character shall be treated as [...] This isn't similar to ZWSP or ZWJ or ZWNJ. Those formatting characters are not stateful; they affect the rendering of, at most, the single characters immediately preceding and following them. The ones you suggest are stateful; they affect the rendering of arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48 ("ANSI") attribute switching, or ISO 2022 character-set switching. Unicode tries hard to avoid encoding such things. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Devanagari and Subscript and Superscript
2015-12-16 19:16 GMT+01:00 Doug Ewell: > The ones you suggest are stateful; they affect the rendering of > arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48 > ("ANSI") attribute switching, or ISO 2022 character-set switching. > Unicode tries hard to avoid encoding such things. You can try as hard as you want, there are cases where it is impossible to avoid stateful encoding if we want to avoid desunifications, or even for some characters that cannot even work without stateful analysis. And this is not solved just by style markup when that "style" is in fact completely semantic. The situation must be taken into account with more care : - For example, the superscript Latin letter o, aka "ordinal masculine", which is not just a superscript but a notation adding the semantics of a abbreviation for the final letters, linked to the other letters before it, the whole being semantically a single word: the superscript style does not create such attachment, it creates a separate "word" inside it, so it was disunified from the letter o. - But it is not a good practive to encode in Unicode things that are just styles without clear semantics (so encoding SUB/SUP is really a bad idea). - On the opposite it is simply impossible to work with Egyptian hieroglyphs as the default clusters are clearly insufficient to create ANY kind of plain-text: you need extra markup to add the necessary semantic, not style, and this markup should be encodable as plain-text without external markup for the presentation when this presenation is fully semantic and clear (e.g. the Egyptian "cartouche" for names of kings). - Similar issue occur with SingWriting and other scripts that DO require always a complex (non-linear) layout where basic clusters are clearly insufficient in ALL texts, meaning that the characters that were encoded are almost **useless** in all plain-text documents: you need extra "format" characters to create some form of orthographic rule, independantly of the style or from an external markup language. I'm in favor of adding **semantic** format characters in Unicode, not stylistic-only format characters, as soon as there does exist a wellknown orthographic convention which whould work independantly of styling. But for now the encoded format characters only work on too small clusters, clusters are only linear and this is clearly not enough (even for instructing other kinds of text analysis (such as breakers). Then the renderers will be adapted and extended to work with more complex clusters with their internal structures with simpler clusters parts). Other renderers using the legacy rules will not be able to do that but will attempt to render some basic fallback (possibly with special visible glyphs for those controls). One kind of semantic format character which is useful and encoded is the "invisible parentheses" for mathematics, which can be encoded for example after a radical sign: use them around a number to define the extension of the radical to more than one digit (and make a clear visual and semantic distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and "sqrt(2)+sqrt(3)").
Re: Devanagari and Subscript and Superscript
Plug Gulp wrote: > It will help if Unicode standard itself intrinsically supports > generalised subscript/superscript text. This falls outside the scope of "plain text" as defined by Unicode, in much the same way as bold and italic styles and colors and font faces and sizes. There are several rich-text formats besides HTML that support arbitrary subscript and superscript text. PDF and Word leap to mind. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Devanagari and Subscript and Superscript
Does the standard support the use of diacritics in plain text format, when used with all and any complex scripts? Regards Sinnathurai > > On 15 December 2015 at 17:46 Doug Ewellwrote: > > > Plug Gulp wrote: > > > It will help if Unicode standard itself intrinsically supports > > generalised subscript/superscript text. > > This falls outside the scope of "plain text" as defined by Unicode, in > much the same way as bold and italic styles and colors and font faces > and sizes. > > There are several rich-text formats besides HTML that support arbitrary > subscript and superscript text. PDF and Word leap to mind. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO > > >
RE: Devanagari and Subscript and Superscript
srivas sinnathurai wrote: > Does the standard support the use of diacritics in plain text format, > when used with all and any complex scripts? It probably depends on what you mean by "support" and "diacritics." I can type a Tamil letter followed by a combining acute accent or diaeresis, and in Arial Unicode MS it actually looks halfway decent. Many years ago, William Overington famously put a combining circumflex on top of U+2604 COMET. You just type one character followed by another and hope for the best, display-wise. You don't get any other special behavior. I'm not sure if this was supposed to be a comment on my statement that arbitrary subscript and superscript is similar to other attributes that are not defined to be part of plain text. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Devanagari and Subscript and Superscript
On Wed, Dec 9, 2015 at 5:18 AM, Martin J. Dürstwrote: > > I suggest using HTML: > > बक ्ष > This will work only if the end-users are always going to use a web browser to view the text content. It will help if Unicode standard itself intrinsically supports generalised subscript/superscript text. I think the meaning of the text should be contained within the text itself rather than relying on external text markers and viewers. That way the text-content creator does not have to rely on what type of unicode compliant text viewer or editor the end user is using. The text should retain it's meaning irrespective of the type of unicode compliant text viewer or editor used. Similarly, if the text has to be saved in a database without losing it's meaning, then either it has to be saved with all the known markers of all the available editors, or some special processing needs to be incorporated to convert some saved marker to markers of various available text viewers and editors. Having generalised Unicode support for superscript and subscript will solve all these problems. Following is one of the use-cases where general Unicode support for superscript/subscript will help tremendously: A math teacher(गणिताचे शिक्षक) in a Marathi(मराठी) language school is writing notes, in her Unicode compliant plain text editor, to explain mathematical terms to her students. Following is an excerpt from the notes that explains terms such as exponents(घातांक) and base(पाया). (English translation is given below): "जेव्हा एखाद्या संखेचा स्वतःशीच अनेक वेळा गुणाकार होतो तेव्हा त्या गुणाकाराला थोडक्यात लिहिण्याच्या पद्धतीला घातांक असे म्हणतात. उदाहरणार्थ, ५ ही संख्या जर स्वतःशी ३ वेळा गुणली जात असेल, म्हणजे ५ x ५ x ५, तर त्याला घातांक पद्धतीत ५^३ असे लिहितात. ह्या घातांकीय रचनेला "५ चा ३ रा घात" असे म्हणतात. आपण अजून एक उदाहरण घेऊया, "२ ना चा १० वा घात", म्हणजे २ ही संख्या स्वतःशी १० वेळा गुणली गेली आहे. ह्याला आपण २^१० असे लिहितो. तर साधारणपणे, कूठलीही संख्या ब जेव्हा स्वतःशी क्ष वेळा गुणलीजाते तेव्हा त्याला घातांक पद्धतीत ब^क्ष असे लिहितात, आणि त्या रचनेला "ब चा क्ष वा घात" असे म्हणतात. इथे ब ह्या संखेला पाया म्हणतात आणि क्ष ह्या संखेला घात असे म्हणतात. तर थोडक्यात, घातांकीय रचनेला पाया^घात असे लिहितात." English translation: "Exponent is a shorthand notation that denotes a multiplication of a number by itself a number of times. For example, if a number 5 is multiplied by itself 3 times i.e. 5 x 5 x 5, then it is represented in an exponential form as 5^3. This exponential term is referred to as "5 raise to the power of 3". Let us consider another example, "2 raise to the power of 10", i.e. 2 is multiplied by itself 10 times. This is written in exponential form as 2^10. So, in general any number b that is multiplied by itself k number of times is written as b^k and the term is referred to as "b raise to the power of k". The number b is called the base, and the number k is called the exponent. In short, exponential term is written as base^exponent." Please note that the teacher had to use a Circumflex Accent (Caret) to indicate superscript, which is an unwritten convention, in the absence of proper superscript support within Unicode. To make the text available to wider audience and still retain it's meaning, the teacher will have to partly rely on Unicode support, partly on the markers available in the various text viewers of her students, partly on the markers available in the text editors of the peer-reviewers of her text and partly on the unwritten convention(such as the caret). This conundrum can be resolved only if there is a generalised support for superscript and subscript within Unicode standard. The standard already has a section for superscript and subscript. Generalising and extending this support will help other languages and scripts. General support for all characters, words and sentences could be achieved by just three new formatting characters, e.g. SCR, SUP and SUB, similar to the way other formatting characters such as ZWS, ZWJ, ZWNJ etc are defined. The new formatting characters could be defined as: SCR: In a character stream, all the characters following this formatting character shall be treated as normal text until either the end of the character stream or the next SUP or SUB character is reached. This shall be the default marker i.e. if no marker is specified then the text shall be treated as normal text until either the end of the character stream or the next SUP or SUB character is reached. SUP: In a character stream, all the characters following this formatting character shall be treated as superscript text until either the end of the character stream or the next SCR or SUB character is reached. SUB: In a character stream, all the characters following this formatting character shall be treated as subscript text until either the end of the character stream or the next SCR or SUP character is reached. A general support within Unicode for subscripting and superscripting
Re: Devanagari and Subscript and Superscript
On Tue, 15 Dec 2015 18:00:16 + (GMT) srivas sinnathuraiwrote: > Does the standard support the use of diacritics in plain text format, > when used with all and any complex scripts? Relatively few scalar value sequences are prohibited - just possibly sequences containing unassigned characters that are not non-characters, but I can't think of any others. (The prohibition on unpaired surrogates applies to coded character sequences, but surrogate characters aren't scalar values.) It would appear by Conformance Requirement C5, 'A process shall not assume that it is required to interpret any particular coded character sequence', that a process is at liberty to decline to interpret a sequence of scalar values, even if it has just interpreted it. I am not aware of any requirements in the standard to interpret specific character sequences. In general, the interpretation of character sequences is undefined. For example, a request for advice on the interpretation of the combination of U+0331 COMBINING MACRON BELOW and U+0E39 THAI CHARACTER SARA UU was answered with the instruction to consult the non-existent typographical tradition. It's been left to rendering engine writers to define the interpretation. Indeed, I am not sure that every sequence of defined scalar values has an interpretation. Most pairs of regional indicators don't have an interpretation, and the interpretation of each variation sequences may change at least twice, once when the base character becomes defined (or is defined not to be a possible base character), and again when the variation sequence is assigned an interpretation as an ill-defined (or grossly ill-defined) family of glyphs. Do U+0337 COMBINING SHORT SOLIDUS OVERLAY and U+20E5 COMBINING REVERSE SOLIDUS OVERLAY have a defined interpretation when their base character is to be represented by a mirrored glyph. Note that in general, the Unicode standard does not define when a character is to be represented by a mirrored glyph. This may be defined by a lower level protocol (the font file). Richard.
Re: Devanagari and Subscript and Superscript
On Tue, Dec 15, 2015 at 11:55:02AM +, Plug Gulp wrote: > Please note that the teacher had to use a Circumflex Accent (Caret) to > indicate superscript, which is an unwritten convention, in the absence > of proper superscript support within Unicode. If the teacher is explaining actual math to his students, then the superscript is the least of his worries. Math typesetting is two dimensional, and is much more complex than regular formated text (not even regular plan text)that it needs its own typesetting engines. There are various plain text markup languages to markup math, if one really wants to represent complex mathematical notation in plain text. Regards, Khaled
Re: Devanagari and Subscript and Superscript
On Wed, 9 Dec 2015 03:24:39 + Plug Gulpwrote: > I am trying to understand if there is a way to use Devanagari > characters (and grapheme clusters) as subscript and/or superscript in > unicode text. Why do you want to do this? Are you asking about writing Devanagari vertically rather than horizontally? If that is what you want, you should be looking at mark-up such as is found in cascading style sheets (CSS). It is an important issue for CJK and Mongolian, and there have been questions as to what is needed for Indian scripts. (There's also an antiquarian interest for historical scripts, such as Phags-pa and even Egyptian - moves are afoot to support the hieroglyphic script as plain text.) Richard.
Re: Devanagari and Subscript and Superscript
On Wed, 9 Dec 2015 03:24:39 + Plug Gulpwrote: > Hi, > > I am trying to understand if there is a way to use Devanagari > characters (and grapheme clusters) as subscript and/or superscript in > unicode text. The view is that such would not be 'plain text', and therefore need not be catered for in Unicode. On the other hand, the desire for spacing raised and lowered characters is sufficient that markup to produce them is widely available, as Martin Dürst pointed out. Non-spacing stacked characters are not common enough for general support to be available. In many Indic scripts, stacking is the normal arrangement, and is supplied via a script-specific special character that is overloaded with a vowel cancellation symbol. However, font-specific deviations from vertical stacking are arranged, and vowels marks are treated independently. There is no provision for vertical stacks to have horiziontal offshoots. (Scripts written vertically are a different case.) For characters stacked directly above and below not in the normal modern fashion of writing words, there can be special characters for special cases. For example, there are U+A8EE COMBINING DEVANAGARI LETTER PA in the Devanagari Extended block and U+0364 COMBINING LATIN SMALL LETTER E. Other, clumsier scheme-specific techniques are available other cases. See for example the writing of nuclides with an explicit atomic number in https://en.wikipedia.org/wiki/Nuclide. The notation needs a mass number at top left and an atomic number at bottom right. A fairly general case is the annotation of kanji known as 'ruby'. Sometimes an application or mark-up scheme will support this directly. Richard.
Re: Devanagari and Subscript and Superscript
Hello Plug, I suggest using HTML: बक ्ष Regards, Martin. On 2015/12/09 12:24, Plug Gulp wrote: Hi, I am trying to understand if there is a way to use Devanagari characters (and grapheme clusters) as subscript and/or superscript in unicode text. It will help if someone could please direct me to any document that explains how to achieve that. Is there a unicode marker that will treat the next grapheme cluster in the unicode text as super/subscript? For e.g. if one wants to represent "ब raise to क्ष" how does one achieve that; is there a marker to represent it as follows: ब + SUP + क + ् + ष where SUP acts as a marker for superscripting the next grapheme cluster. Similar for subscripting. Sorry if this is not the right place to ask this question; in that case please could you direct me to the right forum? Thanks and kind regards ~Plug .
RE: Devanagari Letter Short A
The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91. Neither was it encoded in any of the earlier versions of ISCII. Hence according to the ISCII standard this character simply cannot be formed. Aparna A. Kulkarni -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ernest Cline Sent: Monday, February 16, 2004 10:59 AM To: Unicode List Subject: Devanagari Letter Short A I've been trying to make sense of the Indian scripts, but am having one small difficulty. I can't seem to find the ISCII 1991 equivalent for U+0904 (DEVANAGARI LETTER SHORT A). Is this a character that is part of the set accessed by the extended code (xF0) or was this part of the ISCII 1988 standard that did not survive the changes to ISCII 1991? Alternatively, does ISCII encode this as xA4 + xE0 as this would seem to generate the proper glyph even tho it violates the syllable grammar given in Section 8 of ISCII? Or even more alternatively, am I just missing something that should be obvious, but which for some reason I can't see? Even with the slight differences in the naming conventions between ISCII and Unicode, I don't seem to be misplacing any of the other vowels or consonants. Ernest Cline [EMAIL PROTECTED]
Re: Devanagari Letter Short A
From: Aparna A. Kulkarni [EMAIL PROTECTED] To: [EMAIL PROTECTED]; 'Unicode List' [EMAIL PROTECTED] Sent: Thursday, February 19, 2004 8:23 AM Subject: RE: Devanagari Letter Short A The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91. Neither was it encoded in any of the earlier versions of ISCII. Hence according to the ISCII standard this character simply cannot be formed. Aparna A. Kulkarni So could this character exist only for the purpose of supporting languages that are not covered by ISCII but that share the same Devanagari script, and is then needed for other countries than India? (Here I think about Dravidian transiptions). If there's no ISCII standard related to its meaning or encoding, then what is invalid when coding it with LETTER A then the LETTER SHORT E vowel modifier, possibly with an intermediate INV or other ISCII-compatible control? How would this break ISCII compatibility? Aren't there existing practices to represent LETTER SHORT A in ISCII?
Re: Devanagari Letter Short A
Philippe Verdy va escriure: U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an independant vowel. It can be viewed as a conjunct of the independant vowel U+0905 DEVANAGARI LETTER A and the dependant vowel sign U+0946 DEVANAGARI VOWEL SIGN SHORT E (noted for transcribing Dravidian vowels in the Unicode charts). You may regard it this way, but that is not so. U+0905 followed by U+0946 is really U+090E. Compare with the other scripts to understand why. I don't know why this is not documented, because I can find various sources that use U+0904 or U+0905,U+0946 which have exactly the same rendering and probably the same meaning and usage. Whow! You have various sources that use a character added to Unicode about 2 years and half ago! Impressionnant! About the rendering of U+0905,U+0946, since it violates the usual rules, it is up to your system. Mine does not render it properly, though (unless I cheat). I think that U+0946 was added in ISCII 1991 but was absent from ISCII 1988 No. It was there even in ISCII 83. (I think it's too late to define it: ISCII 1988 has been used consistently before, H... I have really no evidence that ISCII 1988 was used at all... Would be happy to find one, though... Antoine
Re: Devanagari Letter Short A
Ernest Cline wrote: I've been trying to make sense of the Indian scripts, but am having one small difficulty. I can't seem to find the ISCII 1991 equivalent for U+0904 (DEVANAGARI LETTER SHORT A). I do not believe you'll find it there. U+0904 had been added to Unicode for version 4.0. In 2001. URL:http://www.unicode.org/consortium/utc-minutes/UTC-089-200111.html Search for 89-C19. Is this a character that is part of the set accessed by the extended code (xF0) or was this part of the ISCII 1988 standard that did not survive the changes to ISCII 1991? No and no. Alternatively, does ISCII encode this as xA4 + xE0 as this would seem to generate the proper glyph even tho it violates the syllable grammar given in Section 8 of ISCII? It does not. At the very least, if you want to generate this character in ISCII this way, try A4 DB E0 (using INV). This is an ugly hack, of course. As an aside, in some version of ISCII (EA-ISCII, notably), A4 E0 is supposed to be equivalent to AD. This is the way the alphabet is sometimes taught to children in India. Antoine
Re: Devanagari Letter Short A
My understanding of the Indian scripts coded in Unicode, is that the mapping from ISCII to Unicode is not straightforward one-to-one, because ISCII uses a contextual encoding for characters (allowing shifts between several scripts) and some rich-text features. The ISCII character model is not exactly the same as the Unicode character model, even though there was an attempt to make this mapping as simple as possible by allocating the Unicode code points for each individual ISCII-supported script in the same relative order, leaving gaps in the Unicode-encoded scripts for ISCII characters that are not used in one specific script. The good reference for how Indian scripts are coded in Unicode is Chapter 9 of the Unicode 4 reference: http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf In summary with Unicode, the model for Devenagari: - uses consonnantal letters with an implied (default) vowel A, modified by the next coded dependant vowel sign (matra) that create graphic conjuncts with the consonnant, or - uses half-forms of consonnants to drop the implied vowel in initial consonnants, or - uses a virama (halant) U+094D, to mark other omissions of the implied vowel on dead consonnant letters (most often on final consonnants, but this occurs as well on initial or medial consonnants), by removing the final stem of the full (live) consonnant that is normally used to depict also a phonetic syllable boundary with a necessary vowel. So the virama allows creating conjuncts with other following dead consonnants or live consonnants, and normally attaches both consonnant letters into the same syllable or conjunct. - in some cases, the omission of the implied dependant vowel must not create a ligated conjunct, so the virama still needs to represent the omission of the vowel without creating a conjunct that would break the perceived phonetic, and a ZWNJ is used between the dead consonnant (consonnant letter+virama) and the next live consonnant. There's a U+0905 pseudo-consonnant /a/ which is used in absence of a phonetic consonnant, but it follows the same encoding rule as other consonnant letters /*a/, i.e. coding another isolated vowel requires coding /a/ before the vowel sign (matra). This encodes approximately the same thing as isolated vowels, except that the intended rendering is different. U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an independant vowel. It can be viewed as a conjunct of the independant vowel U+0905 DEVANAGARI LETTER A and the dependant vowel sign U+0946 DEVANAGARI VOWEL SIGN SHORT E (noted for transcribing Dravidian vowels in the Unicode charts). I don't know why this is not documented, because I can find various sources that use U+0904 or U+0905,U+0946 which have exactly the same rendering and probably the same meaning and usage. I think that U+0946 was added in ISCII 1991 but was absent from ISCII 1988 (verify, I don't have the ISCII 1988 reference document), so U+0904 has survived just to allow a mostly one-to-one mapping with ISCII 1988. But the addition of U+0946 May be I'm wrong here, and there's some reasons for this choice. there's no canonical or compatibility equivalence defined between U+0904 and U+0905,U+0946 (I think it's too late to define it: ISCII 1988 has been used consistently before, and the Unicode stability policy forbids now defining now new equivalences between them). - Original Message - From: Ernest Cline [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Monday, February 16, 2004 6:28 AM Subject: Devanagari Letter Short A I've been trying to make sense of the Indian scripts, but am having one small difficulty. I can't seem to find the ISCII 1991 equivalent for U+0904 (DEVANAGARI LETTER SHORT A). Is this a character that is part of the set accessed by the extended code (xF0) or was this part of the ISCII 1988 standard that did not survive the changes to ISCII 1991? Alternatively, does ISCII encode this as xA4 + xE0 as this would seem to generate the proper glyph even tho it violates the syllable grammar given in Section 8 of ISCII? Or even more alternatively, am I just missing something that should be obvious, but which for some reason I can't see? Even with the slight differences in the naming conventions between ISCII and Unicode, I don't seem to be misplacing any of the other vowels or consonants. Ernest Cline [EMAIL PROTECTED]
Re: Devanagari Glottal Stop
I wrote: I would have to disagree with these Indian experts in this instance. The Devanagari glottal stop does not have a dot, and indeed, in the languages which use it, this character will certainly coexist with the question mark. They have different shapes, and different functions. At 15:03 -0800 2003-04-05, Mark Davis wrote: Can you respond back to them with the information as to the languages involved? I believe they read the Unicore list, don't they, Mark? N2543 and 02/394 show the character used for the Limbu language, and shows the glyph without a dot and with a horizontal headbar, which the question mark never has. (It also shows an example where, because the typesetters didn't have the letter available they substituted a question mark, but that just goes to show that we need to encode this, because it is a letter, not a punctuation mark.) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Devanagari Glottal Stop
I would have to disagree with these Indian experts in this instance. The Devanagari glottal stop does not have a dot, and indeed, in the languages which use it, this character will certainly coexist with the question mark. They have different shapes, and different functions. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Devanagari Glottal Stop
Can you respond back to them with the information as to the languages involved? Mark ( ) [EMAIL PROTECTED] IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 (408) 256-3148 fax: (408) 256-0799 - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, April 05, 2003 01:45 Subject: Re: Devanagari Glottal Stop I would have to disagree with these Indian experts in this instance. The Devanagari glottal stop does not have a dot, and indeed, in the languages which use it, this character will certainly coexist with the question mark. They have different shapes, and different functions. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Devanagari
Vipul Garg wrote: I have downloaded your font chart for Devanagari, which is in the range from 0900 to 097F. I have also installed the Arial Unicode font supplied by Microsoft office XP suite. I found that not all characters are available for Devanagari. For example letters such as Aadha KA, Aadha KHA, Aadha GA etc. are not available. These letters are required in the devanagari words such as KANYA, NANHA, PARMATMA etc. If you could provide the above letters then our requirement for formation of Devanagari words would be possible. This requirement is very crucial as we have a large volume project on Devanagari language involving data storage in Oracle database. You could try using a different font, for example one of the specialist Devanagari fonts listed at: http://www.alanwood.net/unicode/fonts.html#devanagari Alan Wood http://www.alanwood.net (Unicode, special characters, pesticide names)
RE: Devanagari
Vipal Garg was asking why half characters were not included in Unicode code charts and in his copy of Arial Unicode font. More recent versions of Arial Unicode Do contain half characters etc. for Devanagari. As to the code charts, to answer this, you needed to explore the Unicode web site a bit more to find the answer. Please see the following for detailed information regarding the half characters etc: http://www.unicode.org/unicode/standard/where/ http://www.unicode.org/unicode/faq/indic.html http://www.unicode.org/unicode/uni2book/ch09.pdf Best Regards Andy You Wrote: I have downloaded your font chart for Devanagari, which is in the range from 0900 to 097F. I have also installed the Arial Unicode font supplied by Microsoft office XP suite. I found that not all characters are available for Devanagari. For example letters such as Aadha KA, Aadha KHA, Aadha GA etc. are not available. These letters are required in the devanagari words such as KANYA, NANHA, PARMATMA etc.
RE: Devanagari
Vipul Garg wrote: I have downloaded your font chart for Devanagari, which is in the range from 0900 to 097F. I have also installed the Arial Unicode font supplied by Microsoft office XP suite. I found that not all characters are available for Devanagari. For example letters such as Aadha KA, Aadha KHA, Aadha GA etc. are not available. These letters are required in the devanagari words such as KANYA, NANHA, PARMATMA etc. If you could provide the above letters then our requirement for formation of Devanagari words would be possible. This requirement is very crucial as we have a large volume project on Devanagari language involving data storage in Oracle database. Would appreciate an early reply. Please, see document Where is my character: http://www.unicode.org/unicode/standard/where/ Also have a look to question 17 in the Indic FAQ: http://www.unicode.org/unicode/faq/indic.html#17 All is explained in more detail in Section 9.1 Devanagari of the Unicode manual: http://www.unicode.org/unicode/uni2book/ch09.pdf Regards. M.C.
Re: Devanagari
[EMAIL PROTECTED] scripsit: Au contraire! You might find the attached gif of interest. (This is version 1.0 of the font. Some people might have earlier versions.) Ah, excellent. It has not always been so. If you're not getting Indic shaping with Arial Unicode MS, it's very likely the fault of your software, not the font (and, of course, not Unicode). Indeed, but the original poster specified the use of XP (Windows or Office, I forget which), so I discounted that. -- They do not preach John Cowan that their God will rouse them[EMAIL PROTECTED] A little before the nuts work loose.http://www.ccil.org/~cowan They do not teach http://www.reutershealth.com that His Pity allows them --Rudyard Kipling, to drop their job when they damn-well choose. The Sons of Martha
RE: Devanagari
I came across 'Where is my character?' page and read that there is a combination of keystrokes to represent the Indic half forms, such as KA and Halant combines to form half KA. Also there is a list of other letter representation through combination of Devanagari letters. Please email me the list for my ready reference. Best Regards, Vipul Garg Mind Axis (I) Solutions Pvt. Ltd. Phone: +91 (22) 55994860 / 61 -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 03, 2002 5:33 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Devanagari Vipul Garg scripsit: I have downloaded your font chart for Devanagari, which is in the range from 0900 to 097F. I have also installed the Arial Unicode font supplied by Microsoft office XP suite. I found that not all characters are available for Devanagari. For example letters such as Aadha KA, Aadha KHA, Aadha GA etc. are not available. This is not a Unicode problem. Arial Unicode is not designed to handle Indic scripts; it does not contain the necessary ligatures and half forms. You need to use a more suitable font. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.' --Beverly Erlebacher
RE: Devanagari variations
On Fri, 8 Mar 2002, Marco Cimarosti wrote: Peter Constable wrote: On 03/07/2002 02:16:10 PM James E. Agenbroad wrote: A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. I may be wrong, but I believe that example has ra, halant, ra, independent i . The first ra is the one that transforms into the reph. You are wrong, in fact, sorry. Although figure 9-3 does not show code point values, both the glyphs and the abbreviated letter names make it clear that the sequence is: U+0930 (DEVANAGARI LETTER RA) U+094D (DEVANAGARI SIGN VIRAMA) U+090B (DEVANAGARI LETTER VOCALIC R) James' idea is that the same graphemes could have been better represented with sequence: U+0930 (DEVANAGARI LETTER RA) U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R) It is an interesting idea, because ra never occurs with matra r., so there is no danger of confusion. But it is probably too late for changing it: it would break compatibility with ISCII and existing Unicode fonts. _ Marco Monday, March 11, 2002 ra as reph does occur with r. cf. Monier Williams' Sanskrit-English Dictionary, page 554, second column, between niru_ha and nire (using underscore for macron and for circumflex are nirr.i and nirr.ich and nirr.ij. I believe ISCII is silent on this matter. If so, how can compatibility with it be broken? If fonts have this glyph can't they allow two encodings to invoke it? I do not advocate deletion or deprecation of the encoding shown on page 214 of 3.0 for this glyph, I do advocate saying somewhere in the Unicode Standard discussion of Devanagari that there is another, more plausible and more Indian way to encode this glyph. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
RE: Devanagari variations
Peter Constable wrote: On 03/07/2002 02:16:10 PM James E. Agenbroad wrote: A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. I may be wrong, but I believe that example has ra, halant, ra, independent i . The first ra is the one that transforms into the reph. You are wrong, in fact, sorry. Although figure 9-3 does not show code point values, both the glyphs and the abbreviated letter names make it clear that the sequence is: U+0930 (DEVANAGARI LETTER RA) U+094D (DEVANAGARI SIGN VIRAMA) U+090B (DEVANAGARI LETTER VOCALIC R) James' idea is that the same graphemes could have been better represented with sequence: U+0930 (DEVANAGARI LETTER RA) U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R) It is an interesting idea, because ra never occurs with matra r., so there is no danger of confusion. But it is probably too late for changing it: it would break compatibility with ISCII and existing Unicode fonts. _ Marco
Re: Devanagari variations
At 15:36 -0600 07/03/2002, [EMAIL PROTECTED] wrote: I may be wrong, but I believe that example has ra, halant, ra, independent i . The first ra is the one that transforms into the reph. You're wrong. RI in this case is a way of writing the vocalic r. Compare Kr.s.n.a and Krishna. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari variations
At 15:16 -0500 07/03/2002, James E. Agenbroad wrote: On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote: On 03/06/2002 08:25:18 AM Michael Everson wrote: [snip] In Cham, independent vowels can take dependent vowel signs. In Devanagari, I guess that doesn't occur, but the Brahmic model shouldn't be understood to preclude this behaviour. [snip] - Peter A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. Is that the sample you showed me when I was a-photocopying at the Library of Congress in August, James? You're saying that RA + virama + INDEPENDENT VOCALIC R and RA + VOWEL SIGN VOCALIC R should both produce the same glyph? -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari variations
Using Apple's WorldText, I can confirm that short I did not reorder correctly when preceded by 0294. But the 0294 glyph was in another font. I wonder could we see some samples of this in actual Limbu text? -- Michael Everson *** Everson Typography *** http://www.evertype.com
RE: Devanagari variations
At 11:26 +0100 2002-03-08, Marco Cimarosti wrote: You are wrong, in fact, sorry. Although figure 9-3 does not show code point values, both the glyphs and the abbreviated letter names make it clear that the sequence is: U+0930 (DEVANAGARI LETTER RA) U+094D (DEVANAGARI SIGN VIRAMA) U+090B (DEVANAGARI LETTER VOCALIC R) James' idea is that the same graphemes could have been better represented with sequence: U+0930 (DEVANAGARI LETTER RA) U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R) It is an interesting idea, because ra never occurs with matra r., so there is no danger of confusion. But it is probably too late for changing it: it would break compatibility with ISCII and existing Unicode fonts. Well, Apple's in WorldText version 1.1 I just typed both of these. The first one displayed as RA VIRAMA (visible) VOCALIC R and the second displayed as REPHA VOCALIC R. So in at least one implementation the latter is supported. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari variations
On 03/08/2002 06:54:54 AM Michael Everson wrote: Using Apple's WorldText, I can confirm that short I did not reorder correctly when preceded by 0294. But the 0294 glyph was in another font. I wonder could we see some samples of this in actual Limbu text? It's on its way. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari variations
On 03/08/2002 05:09:46 AM Michael Everson wrote: At 15:36 -0600 07/03/2002, [EMAIL PROTECTED] wrote: I may be wrong, but I believe that example has ra, halant, ra, independent i . The first ra is the one that transforms into the reph. You're wrong. RI in this case is a way of writing the vocalic r. Compare Kr.s.n.a and Krishna. I guess that's what I get for comment on things beyond my ken. Mea culpa. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari variations
Jim Agenbroad responded (off list): Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a RI vowel: RA(d) + RI(n) -- RI(n) +RA(sup) ( parens in lieu ofsubscript) I didn't realise that RI meant the vocalic R. I mistook it to mean something else. I find it a weakness of that section that such notations are not defined and prominently displayed in an easy-to-find location. Thanks for setting me straight. I should have known you knew what you were talking about. Peter
Re: Devanagari variations
[EMAIL PROTECTED] scripsit: I didn't realise that RI meant the vocalic R. It reflects the modern Hindi pronunciation of Skt /r=/. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
RE: Devanagari enthousiasm!
It appears that hindi.exe installs Uniscribe - which, AFAIK, is not permitted by Microsoft - so much for honouring license agreements! That's another reason why they'd package it as an EXE. - rick cameron -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, 6 March 2002 12:14 To: Yaap Raaf Cc: [EMAIL PROTECTED] Subject: Re: Devanagari enthousiasm! On 06-03-2002 04:29:20 PM Yaap Raaf wrote: At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote: I am on a Mac and can't open it, Well, this is going to be a problem for non-Windows clients, I admit. it's a 244K .exe Why an .exe? I don't know if this is what the BBC was trying to do, but using an executable installer package is at least one way to make sure people see the license agreement... Bob
Re: Devanagari variations
At 10:29 -0600 2002-03-08, [EMAIL PROTECTED] wrote: Jim Agenbroad responded (off list): Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a RI vowel: RA(d) + RI(n) -- RI(n) +RA(sup) ( parens in lieu ofsubscript) I didn't realise that RI meant the vocalic R. I mistook it to mean something else. I find it a weakness of that section that such notations are not defined and prominently displayed in an easy-to-find location. Actually, I would like to see that written R with dot below. We should use decent transliteration in those notations; why not? -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari variations
On Fri, 8 Mar 2002, Michael Everson wrote: At 15:16 -0500 07/03/2002, James E. Agenbroad wrote: On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote: On 03/06/2002 08:25:18 AM Michael Everson wrote: [snip] In Cham, independent vowels can take dependent vowel signs. In Devanagari, I guess that doesn't occur, but the Brahmic model shouldn't be understood to preclude this behaviour. [snip] - Peter A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. Is that the sample you showed me when I was a-photocopying at the Library of Congress in August, James? You're saying that RA + virama + INDEPENDENT VOCALIC R and RA + VOWEL SIGN VOCALIC R should both produce the same glyph? -- Michael Everson *** Everson Typography *** http://www.evertype.com Friday, March 8, 2002 Michael, Yes. [Call lme Jim] Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
Re: Devanagari variations
On Fri, 8 Mar 2002 [EMAIL PROTECTED] wrote: Jim Agenbroad responded (off list): Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a RI vowel: RA(d) + RI(n) -- RI(n) +RA(sup) ( parens in lieu ofsubscript) I didn't realise that RI meant the vocalic R. I mistook it to mean something else. I find it a weakness of that section that such notations are not defined and prominently displayed in an easy-to-find location. Thanks for setting me straight. I should have known you knew what you were talking about. Peter Friday, March 8, 2002 Peter, I agree there is a weakness there. Maybe more than one. I have mailed you (Peter) the Deshpande and Monier Williams examples I cited. Have a nice weekend all! Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
RE: Devanagari variations
implementations might not recognise a sequence like consonant, vowel, nukta as valid. For instance, I understand that if Uniscribe encountered such a sequence, it would assume you've left out a consonant immediately before the nukta, and it would display a dotted circle to indicate where a missing base character should go. That behaviour, IMHO, is incorrect. There is no, and was never any kind of grapheme or even combining sequence break at that point, and there should never be a dotted circle displayed through that sequence of characters (a show- individual-characters mode should of course be excepted). /kent k
Re: Devanagari enthousiasm!
On 03/06/2002 03:12:20 PM Michael Everson wrote: But a font is not a ISO/IEC 10646 subset! By definition, it contains glyph codes, not character codes. They are in two different worlds. But in public procurement a subset may be specified, in which case ASCII will be implied. I don't know who made up this rule, by the way. I've never seen any font vendor advertise the capabilities of their fonts in terms of ISO 10646 subsets. In fact, I've never seen any reference to ISO 10646 subsets outside discussions of ISO 10646 such as this. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Devanagari variations
That behaviour, IMHO, is incorrect. There is no, and was never any kind of grapheme or even combining sequence break at that point, and there should never be a dotted circle displayed through that sequence of characters (a show- individual-characters mode should of course be excepted). I agree. What I'm hoping to find out is whether developers of various (whatever) software products can verify whether their code behave correctly in this regard. This would likely be speaking to ICU, Java or other implementations of Devanagari rendering, or Devanagari fonts (OT or otherwise). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari variations
On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote: On 03/06/2002 08:25:18 AM Michael Everson wrote: [snip] In Cham, independent vowels can take dependent vowel signs. In Devanagari, I guess that doesn't occur, but the Brahmic model shouldn't be understood to preclude this behaviour. [snip] - Peter A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. Didn't Mark Twain say he didn't think much of a person who could spell a word in only one way? Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
Re: Devanagari variations
On 03/07/2002 02:16:10 PM James E. Agenbroad wrote: A similar but not the same situation is found in the fourth example in figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the reph (an abridged form of a the consonant 'ra') above it. Unicode wants this encoded as consonant + halant + independent vowel. I believe it is better considered as a consonant + vowel sign combination which happens to have an odd display and at least one Sanskrit textbook agrees. I may be wrong, but I believe that example has ra, halant, ra, independent i . The first ra is the one that transforms into the reph. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari variations
I have gotten the answer on the question Michael raised about the glottal stop: it does *not* have an inherent vowel. So, given that, I return to the original question: quote The question is whether there is any problem using U+0294, and whether proposing a Devanagari-specific character would be a better option. One particular problem I can think would be likely to occur would be rendering engines such as Uniscribe or whatever is coded into host environments like Java for Hindi support would not be able to cope with U+0294 occuring in the midst of a Devanagari sequence. E.g. I could easily imagine something like Uniscribe failing to reorder U+093F before a glottal U+0294. Should we try to educate and convince implementers of the need to allow U+0294 to be reckoned as part of the Devanagari script, or should we propose a new Devanagari glottal character? (I guess on the principle of unifying across languages but not across scripts, it could be argued that a new character should be proposed.) /quote - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED] - Forwarded by Peter Constable/IntlAdmin/WCT on 03/07/2002 10:10 PM - Jeff LB Webster [EMAIL PROTECTED] 03/07/2002 08:59 PM To: [EMAIL PROTECTED] cc: Subject:Re: Devanagari variations Peter, I responded to Steve also, but the short answer is NO, the glottal has no inherent vowel. Jeff
Re: Devanagari enthousiasm!
On 06-03-2002 09:59:48 Yaap Raaf wrote: Win98: You need something called Opentype Devanagri fonts to VIEW the Hindi unicode text. You can get a good font for free from BBC Hindi site. Except that the license that accompanies the font says: COPYRIGHT AND ALL OTHER RIGHT, TITLE AND INTEREST IN THE SOFTWARE BELONGS TO THE BBC, ITS LICENSORS AND SUPPLIERS. THE BBC IS LICENSED BY ITS LICENSORS TO USE AND DISTRIBUTE THE SOFTWARE ON ITS WEBSITE WWW.BBC.CO.UK (THE WEBSITE) AND TO PERMIT YOU TO DOWNLOAD IT, COPY IT ON TO YOUR PERSONAL COMPUTER AND USE IT FOR THE PURPOSES OF ACCESSING, VIEWING AND MAKING, FOR YOUR OWN PERSONAL USE ONLY, ONE PRINT COPY OF THE HINDI LANGUAGE VERSION OF THE WEBSITE. OTHER THAN AS PROVIDED FOR IN THIS SECTION, YOU MAY NOT MAKE ANY USE WHATSOEVER OF THE SOFTWARE OR THE HINDIFONT. I interpret this to mean one may not legitimately use this font for any purpose other than viewing the BBC website. Bob
Re: fj ligature [Re: Devanagari variations]
On Wednesday, March 6, 2002, at 03:24 AM, Herman Ranes wrote: There is a related problem in connection with Norwegian typography: Most fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a commercial font which includes the 'fj' ligature. Apple's Hoeffler font contains an fj ligature. If I'm not mistaken. most of Adobe's Pro fonts do, too. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: Devanagari variations
At 00:12 -0600 2002-06-03, [EMAIL PROTECTED] wrote: (1) The first problem is the need for a glottal character for Limbu (ie, Limbu language written in Devanagri script, as opposed to Limbu script, which already has a symbol for glottal). The Limbu language committee has decided that this character should be represented using what looks pretty much like the IPA glottal symbol (U+0294), though in a Devanagari font it would have to be designed to match Devanagari characters. I see this in version 2 of the Nepali White Paper http://www.cicc.or.jp/english/hyoujyunka/mlit3/7-7-2.pdf The question is whether there is any problem using U+0294, and whether proposing a Devanagari-specific character would be a better option. One particular problem I can think would be likely to occur would be rendering engines such as Uniscribe or whatever is coded into host environments like Java for Hindi support would not be able to cope with U+0294 occuring in the midst of a Devanagari sequence. E.g. I could easily imagine something like Uniscribe failing to reordering U+093F before a glottal U+0294. That almost answers my first question. Does Devanagari glottal have an inherent vowel? If it does, encode a new character. (2) The second problem involves nukta (U+093C). In better-known languages, nukta can occur only on consonants, but for certain lesser-known languages, it can occur on vowels as well. Yet some implementations might not recognise a sequence like consonant, vowel, nukta as valid. For instance, I understand that if Uniscribe encountered such a sequence, it would assume you've left out a consonant immediately before the nukta, and it would display a dotted circle to indicate where a missing base character should go. So what would you suggest? A vocalic-nukta? I wouldn't like that. In Cham, independent vowels can take dependent vowel signs. In Devanagari, I guess that doesn't occur, but the Brahmic model shouldn't be understood to preclude this behaviour. Our people in South Asia have told me the nukta can occur on vowels in the range U+093e..U+094c, though my contact has told me that he himself has only seen this on 093E, 0940, 0941 and 094B. Um, that's AA, II, U, and O. What does the nukta make them sound like? -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: fj ligature [Re: Devanagari variations]
* Herman Ranes [EMAIL PROTECTED] [2002-03-06 11:24]: There is a related problem in connection with Norwegian typography: Most fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a commercial font which includes the 'fj' ligature. From the Adobe OpenType user guide: (http://www.adobe.com/type/browser/pdfs/OTGuide.pdf) # Many Adobe Pro fonts include a large set of standard ligatures, such as # fi, fl, ffi, ffl, ff, fj, ffj, Th, and others. Most other Adobe OpenType # fonts have a smaller set of standard ligatures, like those in Type 1 # fonts. So, you can find some, at least from Adobe, and I guess other quality font vendors. -- * [EMAIL PROTECTED]
Re: Devanagari enthousiasm!
At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote: I interpret this to mean one may not legitimately use this font for any purpose other than viewing the BBC website. If http://www.bbc.co.uk/hindi/images/download_text.gif is any indication, the font doesn't look too promising. Have you seen it? I am on a Mac and can't open it, it's a 244K .exe Why an .exe? BTW, the menus and headers at http://www.bbc.co.uk/hindi/ come out as gibberish on my screen, the articles are fine. (MacOS 9.0) The great enthousiasme in the message I forwarded was what struck me, and then, several times the word 'STANDARD' in capitals! People have to see results before they believe, and spread the word. There was another message announcing Raghu font. Subject: Free Unicode Hindi fonts From:Dakshin Shantakumar [EMAIL PROTECTED] Newsgroups: alt.language.hindi soc.culture.indian Date:2 Mar 2002 13:51:45 -0800 Downloadable here http://www.ncst.ernet.in/~matra/hindi_display.shtml It has about 600 glyphs. But no Latin letters, which, IIRC, disqualifies it as a real Unicode font? Some of the glyphs are are unfamiliar to me. Yaap -- attachment: Raghu_glyphs.jpg
Re: Devanagari enthousiasm!
At 17:29 +0100 2002-06-03, Yaap Raaf wrote: There was another message announcing Raghu font. Subject: Free Unicode Hindi fonts From:Dakshin Shantakumar [EMAIL PROTECTED] Newsgroups: alt.language.hindi soc.culture.indian Date:2 Mar 2002 13:51:45 -0800 Downloadable here http://www.ncst.ernet.in/~matra/hindi_display.shtml It has about 600 glyphs. But no Latin letters, which, IIRC, disqualifies it as a real Unicode font? Some of the glyphs are are unfamiliar to me. The first one is RA + VIRAMA + VOCALIC R, it looks to me. That would be, in principle, a valid but rare syllable. Maybe it's Vedic. The second one, I'm not sure what to think of. It looks like NGA + RA + NUKTA + VOCALIC R, and I couldn't begin to wonder what it's supposed to represent. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: fj ligature [Re: Devanagari variations]
On 03/06/2002 04:24:54 AM Herman Ranes wrote: There is a related problem in connection with Norwegian typography: Most fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a commercial font which includes the 'fj' ligature. That's quite a different problem. All it would take to support that in Uniscribe/OpenType would be to add the appropriate ligature mapping in a GSUB OpenType table; nothing needs to change in Uniscribe. But for cons, vow, nukta , Uniscribe will insert U+25CC as a UI device to let the user know that something is wrong with the data, but that's based on an assumption that nukta can't go only go on a consonant. To solve your problem, you just need to educate type designers of the need. It's relatively simple to solve. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: fj ligature [Re: Devanagari variations]
At 02:24 3/6/2002, Herman Ranes wrote: There is a related problem in connection with Norwegian typography: Most fonts include the 'fi' and 'ffi' ligatures, but I have never heard of a commercial font which includes the 'fj' ligature. Using such a font, the word 'fire' (four) would be ligated correctly, while 'fjerde' (fourth) would not. And exactly what does the rendering of the 'international' loan-word /fjord/ look like in printed matter around the world? I regularly find it unligated in English and German reference works which have in other aspects virtually perfect typography. With the increased support of OpenType layout, I think you will see more fonts supplying an fj ligature. The Adobe Pro fonts contain this ligature, as do any Latin script fonts I have produced for clients over the past three years. Most type designers I know are aware of the need for this glyph, but until now they have not had a standard and reliable way to make it available. This is not really the same issue as Peter has raised with regard to Limbu, where the issue is less the availability of a particular glyph but the handling of a character by shaping engines that might fail to identify it is part of the surrounding text. The former is a typographic and font issue, while the latter is a text processing issue and may be an encoding issue. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] ... es ist ein unwiederbringliches Bild der Vergangenheit, das mit jeder Gegenwart zu verschwinden droht, die sich nicht in ihm gemeint erkannte. ... every image of the past that is not recognized by the present as one of its own concerns threatens to disappear irretrievably. Walter Benjamin
Re: Devanagari enthousiasm!
At 08:29 3/6/2002, Yaap Raaf wrote: There was another message announcing Raghu font. Subject: Free Unicode Hindi fonts From:Dakshin Shantakumar [EMAIL PROTECTED] Newsgroups: alt.language.hindi soc.culture.indian Date:2 Mar 2002 13:51:45 -0800 Downloadable here http://www.ncst.ernet.in/~matra/hindi_display.shtml It has about 600 glyphs. But no Latin letters, which, IIRC, disqualifies it as a real Unicode font? No, a Unicode font does not need to contain Latin letters. There are issues regarding using such fonts in Windows 9x and ME, because these systems require 8-bit codepage support and there are no MS codepages for Indic scripts. This is why MS Mangal, the Hindi UI font that ships with Windows 2000 and XP is not licensed for use on older versions of the OS. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] ... es ist ein unwiederbringliches Bild der Vergangenheit, das mit jeder Gegenwart zu verschwinden droht, die sich nicht in ihm gemeint erkannte. ... every image of the past that is not recognized by the present as one of its own concerns threatens to disappear irretrievably. Walter Benjamin
Re: Devanagari enthousiasm!
At 11:03 -0800 2002-06-03, John Hudson wrote: It has about 600 glyphs. But no Latin letters, which, IIRC, disqualifies it as a real Unicode font? No, a Unicode font does not need to contain Latin letters. A valid ISO/IEC 10646 subset must contain ASCII. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari enthousiasm!
At 11:03 -0800 2002-06-03, John Hudson wrote: No, a Unicode font does not need to contain Latin letters. And Michael Everson responded: A valid ISO/IEC 10646 subset must contain ASCII. But a font is not a ISO/IEC 10646 subset! By definition, it contains glyph codes, not character codes. They are in two different worlds. So it's still true that a font compatible with Unicode need not contain Latin letters. Rick
Re: Devanagari enthousiasm!
On 06-03-2002 04:29:20 PM Yaap Raaf wrote: At 14:02 +0100 2002.03.06, [EMAIL PROTECTED] wrote: I am on a Mac and can't open it, Well, this is going to be a problem for non-Windows clients, I admit. it's a 244K .exe Why an .exe? I don't know if this is what the BBC was trying to do, but using an executable installer package is at least one way to make sure people see the license agreement... Bob
Re: Devanagari enthousiasm!
Michael Everson scripsit: A valid ISO/IEC 10646 subset must contain ASCII. But a 10646 subset is a coded character set, not a font. -- John Cowan http://www.ccil.org/~cowan [EMAIL PROTECTED] To say that Bilbo's breath was taken away is no description at all. There are no words left to express his staggerment, since Men changed the language that they learned of elves in the days when all the world was wonderful. --_The Hobbit_
Re: Devanagari enthousiasm!
At 12:07 -0800 2002-06-03, Rick McGowan wrote: At 11:03 -0800 2002-06-03, John Hudson wrote: No, a Unicode font does not need to contain Latin letters. And Michael Everson responded: A valid ISO/IEC 10646 subset must contain ASCII. But a font is not a ISO/IEC 10646 subset! By definition, it contains glyph codes, not character codes. They are in two different worlds. But in public procurement a subset may be specified, in which case ASCII will be implied. I don't know who made up this rule, by the way. So it's still true that a font compatible with Unicode need not contain Latin letters. OK. Caveat emptor. -- Michael Everson *** Everson Typography *** http://www.evertype.com
10646 subsets (was: Re: Devanagari enthousiasm!)
Michael Everson said: No, a Unicode font does not need to contain Latin letters. A valid ISO/IEC 10646 subset must contain ASCII. Besides others pointing out the obvious disconnect between 10646 subsets and what can be in a valid Unicode font (which contains glyphs, not characters), this statement is not correct even in its proper context. To cite chapter and verse: 10646 defines two kinds of subsets: Limited subsets (clause 12.1) are simply enumerations of any list of code points. (code positions in 10646-speak) There are no constraints on this kind of subset, so it could consist merely of a list of Hebrew combining marks, for example. Selected subsets (clause 12.2) consist of lists of collections from Annex A. It is *selected* subsets which automatically contain U+0020..U+007E. And note that it is only *those* code points which are included, and not ASCII -- which would also imply inclusion of U+..U+001F and U+007F. --Ken
Re: Devanagari variations
On 03/06/2002 08:25:18 AM Michael Everson wrote: That almost answers my first question. Does Devanagari glottal have an inherent vowel? If it does, encode a new character. That seems like a very good metric to consider, and I hadn't thought of it myself. I'd expect that this can be used syllable-initially rather than only finally, and so would have an inherent vowel but don't know that for certain. I've asked my contacts working in S. Asia for further info. (2) The second problem involves nukta (U+093C). In better-known languages, nukta can occur only on consonants, but for certain lesser-known languages, it can occur on vowels as well. Yet some implementations might not recognise a sequence like consonant, vowel, nukta as valid. For instance, I understand that if Uniscribe encountered such a sequence, it would assume you've left out a consonant immediately before the nukta, and it would display a dotted circle to indicate where a missing base character should go. So what would you suggest? A vocalic-nukta? I wouldn't like that. No, I wouldn't suggest anything different. The question is mainly intended to find out to what extent implementers are making assumptions that would present problems. In Cham, independent vowels can take dependent vowel signs. In Devanagari, I guess that doesn't occur, but the Brahmic model shouldn't be understood to preclude this behaviour. There's a general problem: writing systems of lesser-known languages sometimes involve behaviours that don't occur in the writing systems of better-known languages, but software implementations get designed based upon what is known, meaning the better-known writing systems only, and sometimes implementation incorporate constraints based upon what is exemplified in those better-known writing systems. E.g. there are Mon-Khmer languages spoken in Thailand that get written with Thai script but have many more vowel distinctions than Thai and so need to use combinations of combining marks not used in combination for Standard Thai, yet some important software implementations incorporate sequence constraints that treat these combinations as error conditions. Um, that's AA, II, U, and O. What does the nukta make them sound like? I haven't any idea, myself. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari keyboard for WindowsXP
On 02/23/2002 08:58:28 PM yaapraaf wrote: Now I have one more question, related with the next: # Subject: Re: How to create Unicode input methods for MacOS? (long) # Our 'uchr' resources are created using an assembler. It's the # only tool we are aware of that can fill in the offsets the 'uchr' # data structure contains. We don't have a custom 'uchr' editing # tool at this point (we would love to have one...). If a such a comparison could be made, does the Keyman wizard represent the kind of 'intelligence' of the assembler, or is there a fundamental difference between Windows and Mac systems that makes keyboard editing on Windows less difficult than it is for the Mac in the case of 'uchr'? What's the difference? I don't know these things in detail; when I need to know the details I go to my co-worker Jonathan Kew. I have some ideas, though. As I understand it, there are some significant differences. First, my understanding is that the Mac uchr resources (and their predecessor, kchr resources) are tables compiled in a binary format that map from scan codes to characters, usually in a 1:1 manner, though I know kchr could support dead keys -- i.e. a one-level subtable, so that if a key designated to be a deadkey was pressed, then the following keystroke used a different set of lookups. If I recall, kchr resources were part of a script bundle, meaning that each kchr resource had a script code, and there could be at most one kchr resource per script code. I seem to recall that there were 32 possible script codes, but I'd really need to check the docs to make sure. (You can find this stuff on the Apple site if you know where to dig.) I don't know what happens in these regards with uchr resources. My understanding of these kinds of low-level details is also imperfect, but a little better than for the Mac. (When I really need to know details in this area that I'm rusty on, I ask Marc Durdin, the author of Keyman.) From what I understand, Windows is somewhat different, and adding Keyman into the mix makes even more different. Windows uses individual files --- .kbd on Win9x/Me and .dll on NT/2K/XP -- that contain mapping tables to map scan codes into virtual characters and character codes. There can be any number of these on a system, but they have to be associated with a LANGID to use them. Win32 allows lots LANGIDs -- way more than the number of script codes allowed in QuickDraw. Win32 also allows multiple kdbs/dlls to be associate with a given LANGID, except that there's something in Win9x/Me that keeps this from working (I don't know if it's just a UI problem or something at a lower level). I don't know exactly what kinds of mappings can be created in a kbd/dll, but it wouldn't surprise me to learn that it's pretty much comparable to what could be done in a KCHR resource. Now, Keyman is really an entirely different mechanism. It doesn't create a distinct kbd/dll, but makes use of one on the system. It will intercept what is generated by the system, and use its own mechanisms to map to character codes. And its mechanisms are *far* richer than 1:1 mappings or 1:1 mappings plus deadkeys. This is so because it was first created for use with Win3.x to create presentation form-encoded data involving scripts of SE Asia. In other words, it needed to handle some of the kinds of transformations that are handled today by Uniscribe and OpenType. You can think of Keyman's wizard as similar to an assembler, except that what it generates is not processor assembly code but rather rules in the Keyman keyboard description language. For instance, if I use the Keyman wizard and drag the shape for a Devanagari KA (U+0915) onto the K key in the screen representation of a keyboard, then it will generate the rule + k U+0915 in the KMN text file that constitutes the programming code for the input method being created. After you have used the wizard, you can revise the KMN program in whatever way you want, just as if you hadn't used the wizard but were writing the behaviour by hand -- though you wouldn't need to do this if you're input method only requires simple, context-free keystroke-to-character mappings. It may be clear I don't know an assembler from a wizard, but I'm just amazed about the 'problem' at Apple and the (seemingly) simple solution for Windows in this regard. Several years ago, Jonathan Kew created the SILKey program, which is basically Keyman for the Mac. (There was a misunderstanding about some details in the Keyman documentation that resulted in some slight differences between Keyman's description language and that used by SILKey, but eventually both programs had been revved so that a single description could be written to be used on both platforms.) SILKey is available from the SIL web site. But, it has *not* been updated to support Unicode. There are some issues that would be involved in doing so, and since there aren't too many Unicode apps
Re: Devanagari keyboard for WindowsXP
On 02/22/2002 05:26:53 PM Yaap Raaf wrote: I've also been looking at Tavultesoft's Keyman, but there are no readymade keyboards available for the purpose. I don't know how complicated it is to develop one. For simple behaviours, it can be quite easy; e.g. if you just need to assign Devanagari characters to keys on a US keyboard without any additional behaviour considerations, there's a wizard with a visual UI that you can use. If you need, though, you can create fairly sophisticated input methods. How difficult it is depends on how complex the required behaviour is, but the learning curve is **far** less than learning to program in C or to use the Windows DDK. And I've never heard of anyone complaining that a Keyman input method could not keep up with a user's typing speed. BTW, Keyman 6 (in development) will support Microsoft's Text Services Framework, and also (I think) rules that are sensitive to both preceding and following context. That should make it possible to create fairly polished text-editing user interfaces even for Indic scripts or Arabic, where there are complex character/presentation relationships involved. (E.g. I'd guess that Arabic users would find it less distracting if characters appeared initially in non-final forms and only change to final forms after a word-breaking or non-joining character has been entered. If that's true, I can envision implementing that without too much difficulty using Keyman.) - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Devanagari keyboard for WindowsXP
At 15:08 +0100 2002.02.23, [EMAIL PROTECTED] wrote: On 02/22/2002 05:26:53 PM Yaap Raaf wrote: I've also been looking at Tavultesoft's Keyman, but there are no readymade keyboards available for the purpose. I don't know how complicated it is to develop one. For simple behaviours, it can be quite easy; e.g. if you just need to assign Devanagari characters to keys on a US keyboard without any additional behaviour considerations, there's a wizard with a visual UI that you can use. If you need, though, you can create fairly sophisticated input methods. How difficult it is depends on how complex the required behaviour is, but the learning curve is **far** less than learning to program in C or to use the Windows DDK. Thanks Peter, I had hoped for your response as you are the Keyman specialist here. A few others responding off list were of the same opinion. So, no aksharmala, but Keyman. Now I have one more question, related with the next: # X-UML-Sequence: 13313 (2000-04-17 20:30:54 GMT) # X-Nutritional-Content-Warning: Message body contains 53% quoted lines # From: Deborah Goldsmith [EMAIL PROTECTED] # To: Unicode List [EMAIL PROTECTED] # Cc: Marco Piovanelli [EMAIL PROTECTED] # Date: Mon, 17 Apr 2000 12:30:53 -0800 (GMT-0800) # Subject: Re: How to create Unicode input methods for MacOS? (long) # [.] # [.] # Our 'uchr' resources are created using an assembler. It's the # only tool we are aware of that can fill in the offsets the 'uchr' # data structure contains. We don't have a custom 'uchr' editing # tool at this point (we would love to have one...). If a such a comparison could be made, does the Keyman wizard represent the kind of 'intelligence' of the assembler, or is there a fundamental difference between Windows and Mac systems that makes keyboard editing on Windows less difficult than it is for the Mac in the case of 'uchr'? What's the difference? It may be clear I don't know an assembler from a wizard, but I'm just amazed about the 'problem' at Apple and the (seemingly) simple solution for Windows in this regard. Yaap --
RE: Devanagari
David Starner wrote: On Mon, Jan 21, 2002 at 02:20:17PM +0100, Marco Cimarosti wrote: What this means in practice for website developers is: 1) SCSU text can only be edited with a text editor which properly decodes the *whole* file on load and re-encodes it on save. On the other hand, UTF-8 text can also be edited using an encoding-unaware editor, although non-ASCII text is invisible. True for users of Latin-based writing systems. Probably of little comfort to users of Indic or Chinese-based writing systems. I was referring to the task of editing *source* files in HTML, XML, or other computer languages and format. Most of the time, programmers and webmasters are interested in changing the ASCII part of the file (mark-up, instructions), which is the part which most likely contains bugs to be fixed, or to need changes unrelated with the linguistic contents. Of course, the people in charge of writing the *content*, need tools that can display the actual characters. And this is true for users of Latin-based writing system as well: imagine writing in French or German with all occurrences of é, è, ä, ö, ü, etc. transformed into pairs of funny bytes. Better to stick with editors that are aware of your encoding. Of course. Provided that one exists on your platform, and that you are not bound to development tools which don't support it. 2) SCSU text cannot be built by assembling binary pieces coming from external sources. It's not really designed for that. If you're assembling things, just run the output through a UTF-8 to SCSU converter. Which translates to: SCSU is not appropriate for dynamic HTML pages, or for encoding text inside any other kind of application. More generally, SCSU is not appropriate as text encoding, but just as a compression method for documents in their final form. Ciao. _ Marco
Re: Devanagari on MacOS 9.2 and IE 5.1
It should be fine also on Netscape 6.2 [EMAIL PROTECTED] wrote: [EMAIL PROTECTED]"> I spoke to fast. Upon taking a closer look at the file, the font was not set properly. MacOS 9.2, Indian Language Kit, Mac IE 5.1 and Devanagari MT as font face seem to display UTF-8 encoded Hindi just fine.Etienne Date: Mon, 21 Jan 2002 10:24:16 -0800"[EMAIL PROTECTED]" [EMAIL PROTECTED] [EMAIL PROTECTED], [EMAIL PROTECTED]: [EMAIL PROTECTED]RE: DevanagariOn this subject, Win2K and IE5+ seem to do a nice job displaying UTF8-encoded Hindi. On the Mac, the Indian Language Kit provides for OS support and fonts (with MacOS 9.2 and above), but I have not been able to display Hindi (UTF8 encoded) with Mac's IE 5.1. Am I correct in assuming that the Mac version of IE does not support Hindi without a hack?Etienne Reply-To: [EMAIL PROTECTED]"Christopher J Fynn" [EMAIL PROTECTED] [EMAIL PROTECTED]Cc: "Aman Chawla" [EMAIL PROTECTED]RE: DevanagariDate: Mon, 21 Jan 2002 23:59:38 +0600AmanHere in Bhutan the Internet connection is still much worse than in mostplaces I've visited in India Nepal (and the cost per minute is severaltimes higher) - believe me even then UTF-8 (or UTF-16) encoded pages do notdisplay noticeably slower than ASCII, ISCII or 8-bit font encoded pages -and I don't need to download any special plug-ins or fonts.- Chris--Christopher J FynnThimphu, Bhuta n[EMAIL PROTECTED][EMAIL PROTECTED] -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]OnBehalf Of Aman ChawlaSent: 21 January 2002 10:57To: James Kass; UnicodeSubject: Re: Devanagari- Original Message -From: "James Kass" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:46 AMSubject: Re: Devanagari 25% may not be 300%, but it isn't insignificant. As you note, if themark-up were removed from both of those files, the percentage ofincrease would be slightly higher. But, as connection speeds continueto improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems areapprox. 14kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostlyunheard of, efficiency in data transmission is of paramount importance...how can we convince the south asian user to create websites in an encodingthat would make his client's 14 kbps modem as effective (rather,ineffective) as a 4.6 kbps modem? Hot After Christmas DEALS on just about everything!http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099 Hot After Christmas DEALS on just about everything!http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099
Re: Devanagari
At 23:19 -0600 2002-01-20, David Starner wrote: There is no simple encoding scheme that will encode Indic text in Unicode in one byte per character. Raw 32-bit encoding treats all characters equally, doesn't it? :-) -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari
At 00:39 -0500 2002-01-21, Aman Chawla wrote: The issue was originally brought up to gather opinion from members of this list as to whether UTF-8 or ISCII should be used for creating Devanagari web pages. The point is not to criticise Unicode but to gather opinions of informed persons (list members) and determine what is the best encoding for information interchange in South-Asian scripts... If you want only local users who have ISCII fonts to read them, use ISCII. I wouldn't be able to read such pages, though, because I don't have any ISCII support under Mac OS X. I *do* have Unicode-based Devanagari supportm though. The best encoding for information interchange is to use a SINGLE encoding, namely Unicode. For the Web, use UTF-8. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari
* [EMAIL PROTECTED] | | This is why I really wish that SCSU were considered a truly | standard encoding scheme. Even among the Unicode cognoscenti it | is usually accompanied by disclaimers about private agreement only | and not suitable for use on the Internet, where the former claim | is only true because of the self-perpetuating obscurity of SCSU and | the latter seems completely unjustified. Do you know of any published web pages that use SCSU? I think that's probably the place to start. I never add support for encodings I can't find in actual use on the web. (Hint hint. :) Note that IANA *does* consider SCSU a real encoding scheme, since they've defined a tag for it. (Thanks to Markus Scherer, I know, but it does help.) --Lars M.
Re: Devanagari
Aman, What is it you want? To complain about the architecture of Unicode and UTF-8? For good or ill, it isn't going to change. Neither was it a conspiracy to suppress the non-English-speaking peoples of the world. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari
Aman Chawla wrote, With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem? This is a very good question. How, indeed? There are pros and cons for practically any situation and it seems that you are asking these questions in order to help evaluate those pros and cons. A while back, Benefits of Unicode was a very interesting thread on this list and the results of those discussions formed the basis for some web pages on the subject. Tex Texin made the original page about the benefits, which is on-line at: http://www.geocities.com/i18nguy/UnicodeBenefits.html The page offers links to other pages resulting from the same thread, including a page by Suzanne Topping listing some of the disadvantages of Unicode. One way to encourage members of your user community to embrace the standard might be to offer a translation of some of that material in Unicode Hindi, with the respective authors' permissions, of course, and post it on the web. With newer operating systems being Unicode-based, there are no special plug-ins or filters involved. A sophisticated and elegant writing system like Devanagari justifies having a clever rendering system for display. Unicode and OpenType support for such scripts are expected to be built-in to operating systems. As far as I can tell, under the current OpenType model, OpenType support for Indic scripts under ISCII encoding isn't possible because the features required not only for plain text Devanagari, but also for typographically advanced Devanagari are registered to the various Indic Unicode ranges. At least as far as Microsoft OS, a feature like the half-letter form won't be applied to anything encoded in the ASCII or ISO-8859-01 range. (Once again, as far as I can tell.) Encourage people to look towards the future. Studying the trend over the past several years as more groups and systems moved towards the Unicode Standard might foster the belief that anyone converting their existing files from a localized encoding into Unicode would be converting those files for the last time. On the other hand, converting from a localized encoding into ISCII would possibly mean that the material would eventually need to be converted into Unicode anyway. I agree with you that efficient and effective data transmission is extremely important and suggest that the most effective way to exchange data is in a standard fashion which is supported worldwide. Best regards, James Kass.
Re: Devanagari
On Sun, 20 Jan 2002 23:57:29 -0500 Aman Chawla [EMAIL PROTECTED] wrote: With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem? Leave that most dont even have modems as that matter PCs, they use cybercafes, having 8-10 users on a 33Kbps one And AFAIK (execpt the sites containing sanskrit manuscripts) no indian language website uses even ISCII , text there is in some adhoc font encoding. So pages cant be indexed by a search engine, you cant save page read/print it later. Tools created (actually most use just Frontpage) to making indian lang webpages, dont even save pages in ISCII, leave alone export to ISCII. Unicode solves lot of these issues, but existing vendors of ind lang software would find it inconvenient, as they cant lockin users to their products. Lot more can be said but I would go offtopic (I think I am already ;-). Regards, Karunakar
RE: Devanagari
Doug Ewell wrote: Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus an additional byte near the start of the file to set the current window (0x14 = SC4). The problem is what happens if that very byte gets corrupted for any reason... If an octet is erroneously deleted, changed or added from an UTF-8 stream, only a single character would be corrupted. If the same thing happens to the window-setting byte of a SCSU (or other similar zany formats), the whole stream turns into garbage. What this means in practice for website developers is: 1) SCSU text can only be edited with a text editor which properly decodes the *whole* file on load and re-encodes it on save. On the other hand, UTF-8 text can also be edited using an encoding-unaware editor, although non-ASCII text is invisible. 2) SCSU text cannot be built by assembling binary pieces coming from external sources. E.g., you cannot get a SCSU-encoded template file and fill in the blanks with customer data coming from a SCSU-encoded database: each time you insert a piece of text coming from the database, you delete the current window information, turning into garbage the rest of the file. On the other hand, UTF-8 allows this, provided that the integrity of each multi-byte sequence is maintained. 3) A SCSU page can only be accepted by browsers and e-mail readers that are able to decode it. On the other hand, UTF-8 also works on old ASCII-based browsers, although non-ASCII text is clearly not properly displayed. _ Marco
SCSU (was: Re: Devanagari)
In a message dated 2002-01-21 1:33:23 Pacific Standard Time, [EMAIL PROTECTED] writes: Do you know of any published web pages that use SCSU? I think that's probably the place to start. I never add support for encodings I can't find in actual use on the web. (Hint hint. :) This becomes a vicious circle, as it is just as reasonable to say that I never create Web pages in encodings that existing browsers can't support. I'm not sure what is the best way to break this circle, except that when I do finally set up a Web site (\u263a) I might include a parallel SCSU version along with the UTF-8 version, along with a brief description of SCSU. -Doug Ewell Fullerton, California
SCSU (was: Re: Devanagari)
In a message dated 2002-01-21 5:20:55 Pacific Standard Time, [EMAIL PROTECTED] writes: Doug Ewell wrote: Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus an additional byte near the start of the file to set the current window (0x14 = SC4). The problem is what happens if that very byte gets corrupted for any reason... If an octet is erroneously deleted, changed or added from an UTF-8 stream, only a single character would be corrupted. If the same thing happens to the window-setting byte of a SCSU (or other similar zany formats), the whole stream turns into garbage. Yes, SCSU is stateful and the corruption of a single tag, or argument to a tag, could potentially damage large amounts of text. I know this was a big problem in the days of devices and transmission protocols that did little or no error correction. I honestly don't know how big a problem it is today. What this means in practice for website developers is: 1) SCSU text can only be edited with a text editor which properly decodes the *whole* file on load and re-encodes it on save. On the other hand, UTF-8 text can also be edited using an encoding-unaware editor, although non-ASCII text is invisible. I have edited SCSU text using a completely encoding-ignorant MS-DOS editor. Of course I couldn't edit the SCSU control bytes intelligently, but then I can't edit multibyte UTF-8 sequences intelligently with it either. 2) SCSU text cannot be built by assembling binary pieces coming from external sources. E.g., you cannot get a SCSU-encoded template file and fill in the blanks with customer data coming from a SCSU-encoded database: each time you insert a piece of text coming from the database, you delete the current window information, turning into garbage the rest of the file. The current window information is not deleted, it is carried over into any adjoining text that does not redefine it. (This could have its own repercussions, of course.) 3) A SCSU page can only be accepted by browsers and e-mail readers that are able to decode it. On the other hand, UTF-8 also works on old ASCII-based browsers, although non-ASCII text is clearly not properly displayed. Same as 1). If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences that readers must know how to decode. SCSU does use states, like any compression scheme, so an encoding-ignorant tool will probably have more trouble with SCSU than with UTF-8. But I was not arguing to foist SCSU on an unprepared world, I was suggesting that the world should prepare. \u263a -Doug Ewell Fullerton, California
RE: Devanagari
Aman Here in Bhutan the Internet connection is still much worse than in most places I've visited in India Nepal (and the cost per minute is several times higher) - believe me even then UTF-8 (or UTF-16) encoded pages do not display noticeably slower than ASCII, ISCII or 8-bit font encoded pages - and I don't need to download any special plug-ins or fonts. - Chris -- Christopher J Fynn Thimphu, Bhutan [EMAIL PROTECTED] [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Aman Chawla Sent: 21 January 2002 10:57 To: James Kass; Unicode Subject: Re: Devanagari - Original Message - From: James Kass [EMAIL PROTECTED] To: Aman Chawla [EMAIL PROTECTED]; Unicode [EMAIL PROTECTED] Sent: Monday, January 21, 2002 12:46 AM Subject: Re: Devanagari 25% may not be 300%, but it isn't insignificant. As you note, if the mark-up were removed from both of those files, the percentage of increase would be slightly higher. But, as connection speeds continue to improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem?
RE: Devanagari
On this subject, Win2K and IE5+ seem to do a nice job displaying UTF8-encoded Hindi. On the Mac, the Indian Language Kit provides for OS support and fonts (with MacOS 9.2 and above), but I have not been able to display Hindi (UTF8 encoded) with Mac's IE 5.1. Am I correct in assuming that the Mac version of IE does not support Hindi without a hack? Etienne Reply-To: [EMAIL PROTECTED] Christopher J Fynn [EMAIL PROTECTED] [EMAIL PROTECTED]Cc: Aman Chawla [EMAIL PROTECTED] RE: DevanagariDate: Mon, 21 Jan 2002 23:59:38 +0600 Aman Here in Bhutan the Internet connection is still much worse than in most places I've visited in India Nepal (and the cost per minute is several times higher) - believe me even then UTF-8 (or UTF-16) encoded pages do not display noticeably slower than ASCII, ISCII or 8-bit font encoded pages - and I don't need to download any special plug-ins or fonts. - Chris -- Christopher J Fynn Thimphu, Bhutan [EMAIL PROTECTED] [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Aman Chawla Sent: 21 January 2002 10:57 To: James Kass; Unicode Subject: Re: Devanagari - Original Message - From: James Kass [EMAIL PROTECTED] To: Aman Chawla [EMAIL PROTECTED]; Unicode [EMAIL PROTECTED] Sent: Monday, January 21, 2002 12:46 AM Subject: Re: Devanagari 25% may not be 300%, but it isn't insignificant. As you note, if the mark-up were removed from both of those files, the percentage of increase would be slightly higher. But, as connection speeds continue to improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem? Hot After Christmas DEALS on just about everything! http://www.smartshop.com/cgi-bin/main.cgi?ssa=4099
Re: Devanagari
Aman Chawla wrote, I would be grateful if I could get opinions on the following: 1. Which encoding/character set is most suitable for using Hindi/Marathi (both of which use Devanagari) on the internet as well as in databases, and why? In your response, please refer to: http://www.iiit.net/ltrc/Publications/iscii_plugin_display.html, particularly the following paragraphs: snip Unicode is the best. It is the World's standard for computer encoding, and, as such, offers the best possibility that text can be exchanged around the globe and cross-platform. The arguments about relative size are true, but in this day and age are considered unimportant. Graphics files are extremely large in comparison with text files of any script and so are sound files. Devanagari UTF-8 is three bytes. The four byte UTF-8 sequences so far are only used for Plane One Unicode and up. 3. With reference to the previous question, can programs that convert the myriad Devangari encodings in use today to a standard encoding (question 1) be made freely available, and how? Yes, converters exist and are being distributed. Just go to the Google search engine and input character conversion Unicode into the box. Look for ICU and Rosette, to name a few. You might even run across Mark Leisher's download page at: http://crl.nmsu.edu/~mleisher/download.html and see the PERL script for converting the Naidunia Devanagari encoding to UTF-16. 4. Is there any search engine on the internet that maintains an up to date index of sites in Devanagari? If not, what can be done to encourage proprietary search engines to support Hindi? Google supposedly has a Hindi language option, but surprise, it's in Roman script! Several emails to them have elicited the response: At the moment we don't support Devanagari... This appears to be because Google is converting UTF-8 strings input to the search words box into decimal NCRs. Pasted यूनिकोड क्या है into the Google box, it displays fine. Since the What is Unicode? pages are popular and have been up for a while, thought that it would have a good chance of being indexed. But, there were no hits for the resulting search string: #2351;#2370;#2344;#2367;#2325;#2379;#2337; #2325;#2381;#2351;#2366; #2361;#2376; ...which is not surprising since the actual page doesn't use NCRs. Best regards, James Kass.
Re: Devanagari Rupee Symbol
At 11:22 -0500 2002-01-20, Aman Chawla wrote: I am unable to find the Devanagari Rupee sign encoded in Unicode? Is it encoded? If not, why? U+20A8. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Devanagari
At 12:48 AM 1/20/02 -0800, James Kass wrote: The arguments about relative size are true, but in this day and age are considered unimportant. Graphics files are extremely large in comparison with text files of any script and so are sound files. Devanagari UTF-8 is three bytes. The four byte UTF-8 sequences so far are only used for Plane One Unicode and up. If the argument refers to 4-byte sequences for Devanagari, it is not factually 'true', as James points out. More to the point is the following observation: HTML or similar mark-up languages account for an ever growing percentage of transmission of text - even in e-mail. The fact that UTF-8 economizes on the storage for ASCII characters, is a benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and claims a significant fraction of the data. A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead for Devanagari as claimed. Add to that James' observation on graphics files, many of which accompany even the simplest HTML documents and you get a percentage difference between the sizes of an English and Devanagari website (i.e. in its entirety) that's well within the fluctuation of the typical length in characters, for expressing the same concept in different languages. In other words, contrary to the claims made by the argument, it is hard to predict that this structure of UTF-8 will have an observable impact on exchanging data - other than psychological perhaps. In many size constrained application areas it may pay off to do compression. http://www.unicode.org/unicode/reports/tr6 shows how one can compress Unicode Data in Devanagari to a size comparable to that of 8-bit ISCII. However, interchange of this format (SCSU) requires consenting parties. A./
Re: Devanagari
The fact that UTF-8 economizes on the storage for ASCII characters, is a benefit for *all* HTML users, as the HTML syntax is entirely in ASCII and claims a significant fraction of the data. A UTF-8 encoded HTML file, will therefore have (percentage-wise) less overhead for Devanagari as claimed. Add to that James' observation on graphics files, many of which accompany even the simplest HTML documents and you get a percentage difference between the sizes of an English and Devanagari website (i.e. in its entirety) that's well within the fluctuation of the typical length in characters, for expressing the same concept in different languages. The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs - even if you take into account the fluctuation of the typical length in characters, for expressing the same concept in different languages. This is because in some cases one language may express a concept more compactly while in other cases it may not, and on the whole this effect would balance out and can therefore be neglected. Therefore transmission of a Devanagari web page over a network would take thrice as long as that of an English web page using the same images and presenting the same information.
Re: Devanagari
In a message dated 2002-01-20 16:49:17 Pacific Standard Time, [EMAIL PROTECTED] writes: The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs... Therefore transmission of a Devanagari web page over a network would take thrice as long as that of an English web page using the same images and presenting the same information. This conclusion ignores two obvious points, which Asmus already made: (1) The 10 GIFs, each of which may well be larger than the HTML file, take the same amount of space regardless of the encoding of the HTML file. The total number of bytes involved in transmitting a Web page includes everything, HTML and graphics, but the purported factor of 3 applies only to the HTML. (2) The markup in an HTML file, which comprises a significant portion of the file, is all ASCII. So the factor of 3 doesn't even apply to the entire HTML file, only the plain-text content portion. In addition, text written in Devanagari includes plenty of instances of U+0020 SPACE, plus CR and/or LF, each of which which occupies one byte each regardless of the encoding. I think before worrying about the performance and storage effect on Web pages due to UTF-8, it might help to do some profiling and see what the actual impact is. -Doug Ewell Fullerton, California
Re: Devanagari
On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote: : The point was that a UTF-8 encoded HTML file for an English web page : carrying say 10 gifs would have a file size one-third that for a Devanagari : web page with the same no. of gifs - even if you take into account the : fluctuation of the typical length in characters, for expressing the same : concept in different languages. This is because in some cases one language : may express a concept more compactly while in other cases it may not, and on : the whole this effect would balance out and can therefore be neglected. : Therefore transmission of a Devanagari web page over a network would take : thrice as long as that of an English web page using the same images and : presenting the same information. And the whole UTF-8 Devanagari page is probably still smaller than even one of the .gif files. -- Christopher Vance
Re: Devanagari
On Sun, Jan 20, 2002 at 07:39:57PM -0500, Aman Chawla wrote: The point was that a UTF-8 encoded HTML file for an English web page carrying say 10 gifs would have a file size one-third that for a Devanagari web page with the same no. of gifs The point is, that the text for a short webpage is 10k for English and 30k for Devanagari, the HTML will be another 10k for English and another 10k for Devanagari, and the graphics will another 30k for English and another 30k for Devanagari, meaning that the total will be 50k for English and 70k for Devanagari - 40% markup, not 200%. Adding a 150k graphic would make it 200k for English and 220k for Devangari, making it a 10% markup. -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
Re: Devanagari
Doug Ewell wrote, I think before worrying about the performance and storage effect on Web pages due to UTF-8, it might help to do some profiling and see what the actual impact is. The What is Unicode? pages offer a quick study. 14808 bytes (English) 15218 bytes (Hindi) 10808 bytes (Danish) 11281 bytes (French) 9682 bytes (Chinese Trad.) (The English page includes links to all the other scripts, but the individual script pages only link back to the English page. So, the English page is a bit larger than the other pages for this reason, not a fair test if we only count the English and Hindi pages.) The Unicode logo gif at the top left corner of each of these pages takes bytes. A screen shot of the beginning of the Hindi page takes 37569 bytes as a gif, the small portion cropped and attached takes 4939 bytes. The What is Unicode? pages are at: http://www.unicode.org/unicode/standard/WhatIsUnicode.html Best regards, James Kass. hindiwhatis.gif Description: GIF image
Re: Devanagari
At 10:44 PM 1/20/2002 -0500, you wrote: Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Yes, but that is this page only. Are you suggesting that all pages will vary by that factor? Of course not. Please consider whether the space *in practice* is a limiting factor. It seems that folks on the list feel it is not. Not for bandwidth limited applications, and not for disk space limited applications. The amount of space devoted to plain text of any language on a typical web page is microscopic compared tot he markup, images, sounds, and other files also associated with the web page. Are you suggesting that utf-8 ought to have been optimized for Devanagari text? Barry Caplan www.i18n.com -- coming soon...
Re: Devanagari
- Original Message - From: James Kass [EMAIL PROTECTED] To: Aman Chawla [EMAIL PROTECTED]; Unicode [EMAIL PROTECTED] Sent: Monday, January 21, 2002 12:46 AM Subject: Re: Devanagari 25% may not be 300%, but it isn't insignificant. As you note, if the mark-up were removed from both of those files, the percentage of increase would be slightly higher. But, as connection speeds continue to improve, these differences are becoming almost minuscule. With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem?
Re: Devanagari
On Sun, Jan 20, 2002 at 10:44:00PM -0500, Aman Chawla wrote: For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Uncompressed, yes. It shouldn't be nearly as bad compressed - gzip, zip, bzip2, or whatever your favorite tool is. You could also use UTF-16 or SCSU, which will get it down to about 2 or about 1, respectively. What's your point in continuing this? Most of the people on this list already know how UTF-8 can expand the size of non-English text. There's nothing we can do about it. Even if you had brought it up when UTF-8 was being designed, there's not much anyone could have done about it. There is no simple encoding scheme that will encode Indic text in Unicode in one byte per character. It's the pigeonhole principle in action - if you need to encode 150,000 characters, you can't encode each one in one or two bytes, and while you can write encodings that approach that for normal text, they aren't going to be simple or pretty. -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
Re: Devanagari
- Original Message -From: "David Starner" [EMAIL PROTECTED]To: "Aman Chawla" [EMAIL PROTECTED]Cc: "James Kass" [EMAIL PROTECTED]; "Unicode"[EMAIL PROTECTED]Sent: Monday, January 21, 2002 12:19 AMSubject: Re: Devanagari What's your point in continuing this? Most of the people on this list already know how UTF-8 can expand the size of non-English text.The issue was originally brought up to gather opinion from members of thislist as to whether UTF-8 or ISCII should be used for creating Devanagari webpages. The point is not to criticise Unicode but to gather opinions ofinformed persons (list members) and determine what is thebest encodingfor informationinterchange in South-Asian scripts...
Re: Devanagari
On Sun, 20 Jan 2002, Aman Chawla wrote: Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Well a trivial adjustment is to use UTF-16 to store your documents if you know they are going to be predominantly Devangari. Or if you have so much text that the number of extra disks is going to be painful, use SCSU to bring it very close to the ISCII ratio. Of course I would note that you can store millions of pages of plain-text on a single harddisk these days. If you going to be storing so many hundreds of millions of pages of plain text that the number of extra disks is a bother, I am amazed that none of it might be outside the ISCII repetoire. And this huge document archive has no graphics component to go with it... But the real reason for publishing the data in Unicode on the web is so people not using a machine specially configured for ISCII will still be able to read and process the data. [then later wrote:] With regards to South Asia, where the most widely used modems are approx. 14 kbps, maybe some 36 kbps and rarely 56 kbps, where broadband/DSL is mostly unheard of, efficiency in data transmission is of paramount importance... how can we convince the south asian user to create websites in an encoding that would make his client's 14 kbps modem as effective (rather, ineffective) as a 4.6 kbps modem? Can you read 500 characters per second? So long as they are receiving only plain text, even this dwaddling speed is not going to impact them. People wanting to efficiently transfer data will use a compression program. Geoffrey
Re: Devanagari
In a message dated 2002-01-20 20:49:00 Pacific Standard Time, [EMAIL PROTECTED] writes: Usually, when someone offers a large body of plain text in any script, files are compressed in one way or another in order to speed up downloads. This is why I really wish that SCSU were considered a truly standard encoding scheme. Even among the Unicode cognoscenti it is usually accompanied by disclaimers about private agreement only and not suitable for use on the Internet, where the former claim is only true because of the self-perpetuating obscurity of SCSU and the latter seems completely unjustified. Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus an additional byte near the start of the file to set the current window (0x14 = SC4). -Doug Ewell Fullerton, California
Re: Devanagari
In a message dated 2002-01-20 21:49:02 Pacific Standard Time, [EMAIL PROTECTED] writes: The issue was originally brought up to gather opinion from members of this list as to whether UTF-8 or ISCII should be used for creating Devanagari web pages. The point is not to criticise Unicode but to gather opinions of informed persons (list members) and determine what is the best encoding for information interchange in South-Asian scripts... It seems that the only point against Unicode compared to ISCII is the resulting document size in bytes, and this one point is being given 100% focus in the comparison. If the actual question is, What is the most efficient encoding for Devanagari text, in terms of bytes, using only the most commonly encountered encoding schemes and no external compression? then of course you will have loaded the question in favor of ISCII. But when you consider that more browsers today around the world (not just in India) are equipped to handle Unicode than ISCII, and that Unicode allows not only the encoding of ASCII and Devanagari but the full complement of Indic scripts (Oriya, Gujarati, Tamil...) as well as any other script on the planet that you could realistically want to encode, you will probably have to rethink the cost/benefit tradeoff of Unicode. -Doug Ewell Fullerton, California
Re: Devanagari
On Mon, Jan 21, 2002 at 12:57:39AM -0500, [EMAIL PROTECTED] wrote: This is why I really wish that SCSU were considered a truly standard encoding scheme. Even among the Unicode cognoscenti it is usually accompanied by disclaimers about private agreement only and not suitable for use on the Internet, where the former claim is only true because of the self-perpetuating obscurity of SCSU and the latter seems completely unjustified. Does Mozilla support it? If someone's willing to spend a little time, adding it to Mozilla is one way to make it more generally useable. And maybe then IE will get nudged into playing a little catchup . . . -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
RE: Devanagari question
From: Rick McGowan [mailto:[EMAIL PROTECTED]] Mike Ayers wrote: The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Yes, the computer should support the user, but... The invention of new characters to serve multitudes is OK, and international standards will probably continue to support that. But I don't think it's reasonable or appropriate to keep inventing new characters willy-nilly for individuals (as reported), and then expect them to be added to an international standard. That's silly. The onus is not on international standards to support the whimsical production of novel, rarely-used, or nonce characters of the type reported to be generated. That is not established. The degree to which computer or user will dictate what will and will not be permitted has yet to be decided. Certainly, I already have full support for any words that I care to make up - I need merely spell them. Since hanzi are words-as-characters, the issue is much more cloudy, since the position of the Unicode specification (due to the encoding method used) is that hanzi are characters-only. This may not be the final solution. In any case, I still have never seen actual documentary evidence that would prove to me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a hat. People just keep saying that to scare everyone. Sounds like an urban myth to me. Good point. I will go seek a definitive answer. Not much point in discussing this if it doesn't really happen. /|/|ike
Re: Devanagari question
Mark Davis wrote: The Unicode Standard does define the rendering of such combinations, which is in the absence of any other information to stack outwards. A dumb implementation would simply move the accent outwards if there was in the same position. This will not necessarily produce an optimal positioning, but should be readable. Note that it also should increase the line spacing. Note also that the renderer should notice that event, even in when there is interleaved unrelevant (zero-width) characters. And we are using a dumb implementation. Anyway, my point was not about this, which are as you say, the basics of the dumbest renderer. No, I was thinking about the implications of mixing Nagari consonants with kana diacritics (or the contrary); or circling (U+20DD) around Indian conjuncts, or else around superscript digits; or the Tibetan subjoined below Latin letters (how do they attach?); or Jamos followed by a virama or a Telugu length mark. Etc. My point was it is *not* a good idea to render an out-of-context Telugu length mark (U+0C55), when it follows for example a Latin vowel, as a macron, even if this is the "logical" behaviour. Such code will be, IMHO, just waste. If it take megabytes of code to do [that] there is probably something else wrong. I do not count a dumb implementation as "decent". And yes, I was overemphasing with "megabytes". The OT support in FreeType, which does only a small part of this task, is only 315 Kbytes of C code. So I expect the not-so-dumb renderer based on it, to be around 0.5 megabyte. Which does not take in account the code embeeded in the OT fonts themselves. As a result, yes, please remove the "s". Antoine
RE: Devanagari question
From: D.V. Henkel-Wallace [mailto:[EMAIL PROTECTED]] At 06:30 2000-11-14 -0800, Marco Cimarosti wrote: But my point was: not even Mr. Ethnologue himself knows exactly *which* combinations are meaningful, in all orthographic system. And, clearly, no one can figure out which combinations may become meaningful in the *future* -- e.g. when a previously unwritten language gets its orthography, or when the spelling of an already written language gets changed. Sadly, it seems unlikely that any furture change or adoption of orthography will use characters not already supported by the then major computer systems. In fact the trend seems to be the other way, viz Spain's changing of its collation rules. I do not think that this is a trend. The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Only during the current frenzy of computerization is the reverse permitted - this will pass. For a minority language (which all remaining unwritten languages are) the pressure will be strong to use existing combinations (since they won't constitute a large enough community for people to write special rendering support). That depends on how you look at it. From what I understand (which I freely admit I have learned only from this list), Indic languages tend to be supported in toto, and therefore even the currently unwritten ones will belong to a highly non-minority language family. $.02, /|/|ike
RE: Devanagari question
Mike Ayers wrote: The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Yes, the computer should support the user, but... The invention of new characters to serve multitudes is OK, and international standards will probably continue to support that. But I don't think it's reasonable or appropriate to keep inventing new characters willy-nilly for individuals (as reported), and then expect them to be added to an international standard. That's silly. The onus is not on international standards to support the whimsical production of novel, rarely-used, or nonce characters of the type reported to be generated. In any case, I still have never seen actual documentary evidence that would prove to me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a hat. People just keep saying that to scare everyone. Sounds like an urban myth to me. Rick
RE: Devanagari question
On Tue, 14 Nov 2000, Rick McGowan wrote: Mike Ayers wrote: The last I knew, computer-savvy Taiwan and Hong Kong were continuing to invent new characters. In the end, the onus is on the computer to support the user. Yes, the computer should support the user, but... The invention of new characters to serve multitudes is OK, and international standards will probably continue to support that. But I don't think it's reasonable or appropriate to keep inventing new characters willy-nilly for individuals (as reported), and then expect them to be added to an international standard. That's silly. The onus is not on international standards to support the whimsical production of novel, rarely-used, or nonce characters of the type reported to be generated. In any case, I still have never seen actual documentary evidence that would prove to me that in fact Taiwan and Hong Kong *ARE* creating new characters at the drop of a hat. People just keep saying that to scare everyone. Sounds like an urban myth to me. I think there is some confusion between "new characters" in the sense that they were never available in any standard, but which are taken from pre-existing print sources, and now people would like to properly add them; versus "new characters" that were made up "yesterday" for frivolous reasons. Thomas Chan [EMAIL PROTECTED]
RE: Devanagari question
Antoine Leca wrote: My understanding is that there are a number of similar cases, which are not officially prohibited (AFAIK), but does not carry any sense. For example, how about digits followed by accents (as combining marks)? Or the kana voicing/voiceless combining marks, when they follow anything other than hiragana or katakana? I think that the original idea behind having combining marks in Unicode was that *any* combination of base + diacritic should be permitted, and be handled decently by rendering engines. The reason for this is that there are thousands of languages in the world, and their orthographies may require an uncommon usage of diacritics. E.g., talking about katakana, I think that the orthography of the Ainu language requires some combinations of syllable + voiced sign (or was it syllable + semivoiced sign) that do not exist in Japanese. If font designers and d. engines implementers insist in the idea that an "accented letter" may be rendered only if an ad-hoc glyph has been anticipated in the font, many minority languages will never have a chance of being supported at a reasonable cost. This is not to say that fonts should *not* have precomposed glyphs. Precomposed glyphs are useful in *some* cases, for providing a *better*, *nicest*, rendering for *some* troublesome combinations. Less common combinations, used in less known languages, may get along with a less-than-perfect rendering -- but *no* rendering at all is not acceptable, IMHO! Sorry for stating the obvious, but I see that actual implementation often have an attitude towards precomposed glyphs that I don't see a reason for. _ Marco __ La mia e-mail è ora: My e-mail is now: marco.cimarostiªeurope.com (Cambiare "ª" in "@") (Change "ª" to "@") __ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup
Re: Devanagari question
The Unicode Standard does define the rendering of such combinations, which is in the absence of any other information to stack outwards. Implementations that can't do that will either overstrike, or use some other fallback rendering. A sophisticated rendering will use positioning such as control point matching to get optimal positioning. A dumb implementation would simply move the accent outwards if there was in the same position. This will not necessarily produce an optimal positioning, but should be readable. It may take a non-trivial amount of code to do the former (especially if it means adding control point hinting, as in TrueType). If it take megabytes of code to do the latter there is probably something else wrong. Mark - Original Message - From: "Antoine Leca" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, November 13, 2000 10:11 Subject: Re: Devanagari question Marco Cimarosti wrote: Antoine Leca wrote: My understanding is that there are a number of similar cases, which are not officially prohibited (AFAIK), but does not carry any sense. I think that the original idea behind having combining marks in Unicode was that *any* combination of base + diacritic should be permitted, The fact that it is permitted (as I said, they "are not prohibited") does not per se give them any sense... This was my point, but I was not clear enough. and be handled decently by rendering engines. The question here is the meaning of "decently". I beg your pardon, but as the programmer of a rendering engine, I cannot agree that I should spend hours and days, and furthermore adding megabytes of code, to render "decently" combinations like digits + accents (by decently, I mean I should check if the glyph for the digit have ascender above x-height, or being of narrower width, and then adjust the position of the diacritic accordingly; similarly, adjusting the descender position of the Nagari virama according to the descender depth of a preceding "g" or "j" or "y".) At the contrary, I believe that when a combination is not expected, the renderer should have a very basic and straightforward behaviour, and just "print" the default glyphs in order, with overstriking when the second glyph is a combining mark. Doing something more complex, in addition to be IMHO a complete lost of time for both the programmer and the users (to load unusued code), is also likely to give some users the idea that using some weird combinations are handled this ("clever") way everywhere, thus leading to chaos when the datas will be brought elsewhere. If font designers and d. engines implementers insist in the idea that an "accented letter" may be rendered only if an ad-hoc glyph has been anticipated in the font, many minority languages will never have a chance of being supported at a reasonable cost. I never say (nor I hope I implied) such an idea. Now, insisting that any renderer should align properly any diacritic on the top (or bottom) middle of the I, M and W glyph, will have for net result that nobody will never be able to create any renderer... Less common combinations, used in less known languages, may get along with a less-than-perfect rendering -- but *no* rendering at all is not acceptable, Where anyone stated such an idea? Antoine
RE: Devanagari Consonant RA Rule R2
On Wed, 8 Nov 2000, Apurva Joshi wrote: The RA[sup] is seen applied to the independent vowel Vocalic R (U+ 090B) in printed samples in Sanskrit. There are atleast the following words that contain the above: NaiRiTa (the name of a demon) = 0928 090B Ra[sup] 0924 NaiRiTi (the goddess Durga, slayer of demons) = 0928 090B Ra[sup] 0924 0940 NaiRiTYa (south-west) = 0928 090B Ra[sup] 0924 094D 092F The Devanagari shaping engine in Uniscribe currently recognises a 0930 094D preceding only consonants, to be duely reordered to the end of the syllable and replaced with Ra[sup]. Whether this be extended to independent vowels had figured in internal discussions when the shaping engine was being planned. To the best of my knowledge, extending this to be applicable to Vocalic R would be a special case, because Ra[sup] is not seen to be applied to any other Indic vowel in words that are native to Indic languages. Would be glad to hear from any expert on this list, if there are phonemes/sounds in any language, which when transliterated into Devanagari, would require the Ra[sup] to be applied to an independent vowel. eg. vowel E Ra[sup] etc. Thanks, -apurva -Original Message- From: Eric Mader/Cupertino/IBM [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 08, 2000 10:24 AM To: Unicode List Subject: Devanagari Consonant RA Rule R2 Hello, In the Devanagari section of the standard, rule R2, on page 217 of the version 3.0 standard, states, "If the dead consoant RA[d] preecesd either a consonant *or an independent vowel,* then it is replaced by the superscript nonspacing mark RA[sup]..." I've never seen a RA[sup] applied to an indpenedent vowel, and non of the software I can find that renders Devanagari does this; they all render a dead RA followed by the vowel. Is the rule in error, or is it written to cover some obscure case that most software doesn't bother with? Eric Mader Wednesday, November 8, 2000 First, I'm not an expert in Sanskrit but have done some work with Devanagari. I think at figure 9-3 (4) on page 214 and at R2 on page 217 Unicode 3.0 overstates and mistates the situation a bit. What is being described is, I believe, a rendering issue, not an encoding issue. Instead of involving an independent vowel, it involves the r consonant, U+0930, immediately followed by the R vowel sign (matra), U+0943, which happens to get rendered as the independent vowel, U+090B with the superscript R, reph, above it--with no halant between the consonant and the vowel sign. On page 24 of Hester Lambert's Introduction to Devanagari, "The vowel sign of [U+090B] is not written with [U+0930, 094D] The character representing {0930, 094D] with [U+090B] is written with the superscribed stroke used to represent [0930, 094D] when it is to be realized before another consonant with character without an intervening vowel {i.e. reph]. This stroke is placed over the vowel character [U+090B], as in [U+0928, 093F, 090B, reph, 0924, 093F] nirrti." The order of filing 'nirri' (dot under second r) in Monier Williams Sanskrit-English dictionary (page 554, column 2) tends to confirm this interpretation: It has after nirUha, nirri, nirrich and nirrij (with a dot below the second r) followed by nire. It is possible that this peculiar rendering practice would extend to the RA followed by U+0944, 0962 or 0963 but they seem to me too unlikely to dwell on. I suppose (by analogy to having two ways to encode many letters with diacritics) Unicode could allow two ways to encode what looks like "R vowel with reph"; at present it describes the one with a halant but is silent about the display when the r consonant is immediately followed by the r matra, U+0930, 0943. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.