RE: Pupil's question about Burmese
FWIW: The OS really likes Unicode, so lots of the text input, etc, are really Unicode. ANSI apps (including non-Unicode web pages), get the data back from those controls in ANSI, so you can lose data that it looked like you entered. As mentioned, the solution is to fix the app to use Unicode. Especially for a language like this. In these cases, machines will be fairly inconsistent even if they did support some code page, but Unicode works most everywhere. Usually it's not difficult for a web page to switch to UTF-8. If it's a form, it's even possible that overriding it on your end might get the data posted back in UTF-8 and succeed (if you're really lucky), but the real fix is to have the web server serve Unicode. -Shawn http://blogs.msdn.com/shawnste From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of Peter Constable [peter...@microsoft.com] Sent: Tuesday, November 09, 2010 10:42 PM To: James Lin; Ed Cc: Unicode Mailing List Subject: RE: Pupil's question about Burmese A non-Unicode web page is like a non-Unicode app. Web pages, and apps, should use Unicode.' Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of James Lin Sent: Tuesday, November 09, 2010 11:24 AM To: Ed Cc: Unicode Mailing List Subject: RE: Pupil's question about Burmese Oh, don't get me wrong. By having Unicode is like wearing a crown and be a king. It's best thing out there. What I am referring is, if a web page is not Unicode supported, or any applications that do not support Unicode, even if running a windows 7 with English locale(even though natively, it supports UTF-16), it is not possible to directly copy/paste without having the correct supported locale, if not, you may damaging the bytes of the characters which show corruptions. Even though most modern API is and hopefully written in Unicode calls, not all (legacy) applications are written in Unicode, so conversion is still necessary to even handling the non-ASCII data. Let me know if I am still missing something here. -Original Message- From: Ed [mailto:ed.tra...@gmail.com] Sent: Tuesday, November 09, 2010 11:02 AM To: James Lin Cc: Unicode Mailing List Subject: Re: Pupil's question about Burmese Yes, displaying is fine, but the original question is copying and pasting; without the correct locale settings, you can’t copy/paste without corrupting the byte sizes. Copy/paste is generally handle by OS itself, not application. Even if you have unicode support application, you can display, but you can’t handle none-ASCII characters. Why not? Modern Win32 OSes use UTF-16. Presumably most modern applications are written using calls to the modern API which should seamlessly support copy-and-paste of Unicode text, regardless of script or language -- so long as the script or language is supported at the level of displaying the text correctly and you have a font that works for that script. Actually, even if the text display is imperfectly (i.e., one sees square boxes when lacking a proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex Text Layout script like Burmese), copy-and-paste of the raw Unicode text should still work correctly. Is this not the case?
Re: Pupil's question about Burmese
On 11/10/2010 02:17 PM, Shawn Steele wrote: As mentioned, the solution is to fix the app to use Unicode. Especially for a language like this. In these cases, machines will be fairly inconsistent even if they did support some code page, but Unicode works most everywhere. Afaik there never has been a standard code page for Myanmar text, Unicode was the first time storage of Burmese text was standardised for computers. There are several different legacy font families in use for Myanmar each with their own slightly different mapping to Latin code points. The font in question has a Unicode cmap table, but the map is from Latin code points to glyphs, not from Myanmar code points to glyphs. There are also several fonts which map incorrectly from the Myanmar Unicode block using the Mon, Shan and Karen code points for glyph variants so the font can avoid having OpenType/Graphite/AAT rules. If anyone is having trouble installing genuine Myanmar Unicode fonts, then I have some instructions at http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/gettingStarted.php Keith
Are Latin and Cyrillic essentially the same script?
As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf = L2/10-356, there exists a Latin letter which resembles the Cyrillic soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif variant of the alphabet, which was used for several languages in the former Soviet Union (e.g. Tatar), and was developed in parallel to the alphabet nowadays in use for Turk and Azerbaijan, see: http://en.wikipedia.org/wiki/Janalif . In fact, it was proposed on this base, being the only Jaꞑalif letter missing so far, since the ꞑ (occurring in the alphabet name itself) was introduced with Unicode 6.0. The letter is no soft sign; it is the exact Tatar equivalent of the Turkish dotless i, thus it has a similar use as the Cyrillic yeru Ы/ы (U+042B/U+044B). In this function, it is a part of the adaptation of the Latin alphabet for a lot of non-Russian languages in the Soviet Union in the 1920s, see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941, http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 . (A proposal regarding this subject is expected for 2011.) Thus, it shares with the Cyrillic soft sign its form and partly the geographical area of its use, but in no case its meaning. Similar can be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р (U+0420/U+0440, Cyrillic letter ER). According to the pre-preliminary minutes of UTC #125 (L2/10-415), the UTC has not accepted the Latin Ь/ь. It is an established practice for the European alphabetic scripts to encode a new letter only if it has a different shape (in at least one of the capital and small forms) regarding to all already encoded letter of the same script. The Y/y is well known to denote completely different pronunciations, used as consonant as well as vocal, even within the same language. Thus, if somebody unearths a Latin letter E/e in some obscure minority language which has no E-like vocal, to denote a M-like sound and in fact to be collated after the M in the local alphabet, this will probably not lead to a new encoding. But, Latin and Cyrillic are different scripts (the question in the Re of this mail is rhetorical, of course). Admittedly, there also is a precedence for using Cyrillic letters in Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone letters in Zhuang. However, the orthography using them was short-lived, being superseded by another Latin orthography which uses genuine Latin letters as tone marks (J/j and X/x, in this case). On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь did not lose the Ь/ь by an improvement of the orthography, but were completely deprecated by an ukase of Stalin. Thus, they continue to be the Latin alphabets of the respective languages. Whether formally requesting a revival or not, they are regarded as valid by the members of the cultural group (even if only to access their cultural inheritance). Especially, it cannot be excluded that persons want to create Latin domain names or e-mail addresses without being accused for script mixing. Taking this into account, not mentioning the technical problems regarding collation etc. and the typographical issues when it comes to subtle differences between Latin and Cyrillic in high quality typography, it is really hard to understand why the UTC refuses to encode the Latin Ь/ь. A quick glance at the Юшманов table mentioned above proves that there is absolutely no request to duplicate the whole Cyrillic alphabet in Latin, as someone may have feared. - Karl Pentzlin
Re: Are Latin and Cyrillic essentially the same script?
2010-11-10 10:08, I wrote: KP As shown in N3916 ... Please read vowel instead of vocal throughout the mail. Sorry.
Combining Triple Diacritics (N3915) not accepted by UTC #125
From the Pre-Preliminary minutes of UTC #125 (L2/10-416): C.4 Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353] - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf [125-A13] ... UTC does not believe that either solution A or solution B represents an appropriate encoding solution for the text representation problem shown in this document. Appropriate technology involving markup should be applied to the problem of representation of text at this level. This will not happen. Linguists will continue to use their PUA code points (or even their 8-bit fonts), which employ these characters perfectly (albeit using precomposed glyphs for the used combinations). This is not plain text. It *is*, at least for the applications in dialectology where groups of three characters linked by one of the proposed triple diacritics have a well-defined and documented meaning. This is also proven by the fact that the existing PUA characters fulfill perfectly the needs of the relevant academic communities, except being interchangeable without using special fonts containing these PUA characters (a request which could be overcome when these characters are contained in Unicode). Processes such as line-breaking do not know about these, or the double diacritics, and this creates problems for processes. Problems are there to be solved, and they are solvable. E.g., simply state that no line break may occur in the realm of a diacritic spanning over three letters. Latin *is* a complex script, anyway. - Karl Pentzlin
Re: Combining Triple Diacritics (N3915) not accepted by UTC #125
On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote: From the Pre-Preliminary minutes of UTC #125 (L2/10-416): C.4 Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353] - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf [125-A13] ... UTC does not believe that either solution A or solution B represents an appropriate encoding solution for the text representation problem shown in this document. Appropriate technology involving markup should be applied to the problem of representation of text at this level. This will not happen. Linguists will continue to use their PUA code points (or even their 8-bit fonts), which employ these characters perfectly (albeit using precomposed glyphs for the used combinations). Advanced typesetting engines like TeX (which were invented 30 years ago, mind you) already support wide accents that span multiple characters: $\widehat{abcd}$ $\widetilde{abcd}$ \bye Even math formulas in new MS Office versions can do that (well it is math because, apparently, only mathematicians cared about that, but I don't see why it should not work for linguists too). Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
RE: Combining Triple Diacritics (N3915) not accepted by UTC #125
You can put diacritics over an arbitrarily large base by using an accent object in a math zone. For example, in my email editor (Outlook), I type alt+= to insert a math zone and then (a+b)\tildespacespace to get [cid:image001.png@01CB80BE.389DD340] (wide tilde over a+b). Evidently linguistic analysis yet another field in which mathematical typography is useful. Murray inline: image001.png
Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
If you want to get that point across to a general audience, you could use a more colloquial term, albeit one that itself derives from mathematics. Text that can be completely expressed in ASCII is fits into something (ASCII) that works as a lowest common denominator of a large number of character sets. You could call it lowest common denominator text. Since ASCII is the only set that exhibits such a lowest common denominator relationship with enough other sets to make it interesting, and since that relation is so well known, it's usually enough to just refer to it by name (ASCII) without needing a general term - except perhaps for general audiences that aren't very familiar with it. In this kinds of discussions I find it invariably useful to mention that the copyright sign is not part of ASCII. (I suspect that it's the most common character that makes a text lose its lowest common denominator status). A./ On 11/10/2010 11:41 AM, Jim Monty wrote: Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Mark *— Il meglio è l’inimico del bene —* On Wed, Nov 10, 2010 at 12:38, Asmus Freytag asm...@ix.netcom.com wrote: If you want to get that point across to a general audience, you could use a more colloquial term, albeit one that itself derives from mathematics. Text that can be completely expressed in ASCII is fits into something (ASCII) that works as a lowest common denominator of a large number of character sets. You could call it lowest common denominator text. Since ASCII is the only set that exhibits such a lowest common denominator relationship with enough other sets to make it interesting, and since that relation is so well known, it's usually enough to just refer to it by name (ASCII) without needing a general term - except perhaps for general audiences that aren't very familiar with it. That is actually not the case. There are superset relations among some of the CJK character sets, and also -- practically speaking -- between some of the windows and ISO-8859 sets. I say practically speaking because in general environments, the C1 controls are really unused, so where a non ISO-8859 set is same except for 80..9F you can treat it pragmatically as a superset. What are also tricky are the 'almost' supersets, where there are only a few different characters. Those definitely cause problems because the difference in data is almost undetectable. In this kinds of discussions I find it invariably useful to mention that the copyright sign is not part of ASCII. (I suspect that it's the most common character that makes a text lose its lowest common denominator status). A./ On 11/10/2010 11:41 AM, Jim Monty wrote: Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
RE: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Or did you mean this is UTF-8 even though in only has characters that also look like ASCII? I was a bit confused :) If you are communicating this information, then that's probably also a good time to also communicate Use Unicode, like UTF-8, and you won't have this kind of problem! -Shawn -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Asmus Freytag Sent: Wednesday, November 10, 2010 12:39 PM To: Jim Monty Cc: unicode@unicode.org Subject: Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding? If you want to get that point across to a general audience, you could use a more colloquial term, albeit one that itself derives from mathematics. Text that can be completely expressed in ASCII is fits into something (ASCII) that works as a lowest common denominator of a large number of character sets. You could call it lowest common denominator text. Since ASCII is the only set that exhibits such a lowest common denominator relationship with enough other sets to make it interesting, and since that relation is so well known, it's usually enough to just refer to it by name (ASCII) without needing a general term - except perhaps for general audiences that aren't very familiar with it. In this kinds of discussions I find it invariably useful to mention that the copyright sign is not part of ASCII. (I suspect that it's the most common character that makes a text lose its lowest common denominator status). A./ On 11/10/2010 11:41 AM, Jim Monty wrote: Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Specifically for ASCII, a common term is seven-bit ASCII. markus
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Mark Davis wrote: What are also tricky are the 'almost' supersets, where there are only a few different characters. Those definitely cause problems because the difference in data is almost undetectable. For example, Mark is referring to cases such as ISO 8859-1 and 8859-15. Those share all the same encoded characters except those at the code points 0xA4, 0xA6, 0xA8, 0xB4, 0xB8, and 0xBC..0xBE. So neither of the repertoires is a proper subset of the other, but the two coded character sets share the vast majority of their characters, including almost all of the common ones. --Ken
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
Even more interesting is Windows 1252 and ISO8859-15 where the former is a repertoire superset of the latter for the graphic characters, but not an encoding superset. On Wed, Nov 10, 2010 at 5:53 PM, Kenneth Whistler k...@sybase.com wrote: Mark Davis wrote: What are also tricky are the 'almost' supersets, where there are only a few different characters. Those definitely cause problems because the difference in data is almost undetectable. For example, Mark is referring to cases such as ISO 8859-1 and 8859-15. Those share all the same encoded characters except those at the code points 0xA4, 0xA6, 0xA8, 0xB4, 0xB8, and 0xBC..0xBE. So neither of the repertoires is a proper subset of the other, but the two coded character sets share the vast majority of their characters, including almost all of the common ones. --Ken
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
I like lowest common denominator as a helpful term. It's familiar and means just the right thing, euphemistically. Thank you, Asmus. You groked what I struggled to express. Jim Monty - Original Message From: Asmus Freytag asm...@ix.netcom.com To: Jim Monty jim.mo...@yahoo.com Cc: unicode@unicode.org Sent: Wed, November 10, 2010 1:38:55 PM Subject: Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding? If you want to get that point across to a general audience, you could use a more colloquial term, albeit one that itself derives from mathematics. Text that can be completely expressed in ASCII is fits into something (ASCII) that works as a lowest common denominator of a large number of character sets. You could call it lowest common denominator text. Since ASCII is the only set that exhibits such a lowest common denominator relationship with enough other sets to make it interesting, and since that relation is so well known, it's usually enough to just refer to it by name (ASCII) without needing a general term - except perhaps for general audiences that aren't very familiar with it. In this kinds of discussions I find it invariably useful to mention that the copyright sign is not part of ASCII. (I suspect that it's the most common character that makes a text lose its lowest common denominator status). A./ On 11/10/2010 11:41 AM, Jim Monty wrote: Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
On 2010/11/11 6:28, Mark Davis ☕ wrote: That is actually not the case. There are superset relations among some of the CJK character sets, and also -- practically speaking -- between some of the windows and ISO-8859 sets. I say practically speaking because in general environments, the C1 controls are really unused, so where a non ISO-8859 set is same except for 80..9F you can treat it pragmatically as a superset. Yes, except that the terms superset/subset (and set in general) shouldn't be used unless you really strictly speak about the repertoire of characters, and not the encoding itself. So e.g. the repertoire of iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1 is not a subset of UTF-8, not because you can't label some text encoded as iso-8859-1, but because subset relationships among the encodings themselves don't make sense). Also, US-ASCII is not a subset of UTF-8, because when you just use the names of the character encodings, you mean the character encodings, and character encodings don't have subset relationships. It may as well be possible to use (create?) the term sub-encoding, saying that an encoding A is a sub-encoding of encoding B if all (legal) byte sequences in encoding A are also legal byte sequences in encoding B and are interpreted as the same characters in both cases. In this sense, US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding of many other encodings. You can also say that iso-8859-1 is a sub-encoding of windows-1252 if the former is interpreted as not including the C1 range. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
* Jim Monty wrote: Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. You are asking for a term to say something unambiguously (just this), but then tell us that you wish to talk about ambiguity (multiple). If you want to talk about just this then there is no specific instance of text, so the problem this is X but it could also be Y or Z does not arise. If you want to talk about multiple then you lack a frame of re- ference and all the multiple are equivalent. Fundamentally, I do not think it makes sense to say that some text is in some encoding. Text is text, you wouldn't pick up a dead-tree kind of book and say Oh, this is UTF-8 and US-ASCII and ISO-8859-1 encoded be- cause it uses only letters found in the ASCII repertoire. If you have a container that contains only bit strings that are UTF-8 encoded sequences of Unicode scalar values, then do not talk about any specific thing that could go in that container. If you have a specific sequence of Unicode scalar values and a string of bits, and want to point out that for that specific bit string many en- codings map the string to the same sequence of Unicode scalar values, then I do not see why you would need a specific term. Perhaps http://en.wikipedia.org/wiki/Polyglot_(computing) is relevant here. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Combining Triple Diacritics (N3915) not accepted by UTC #125
Or the other way around... On Thu, Nov 11, 2010 at 08:53:49AM +0200, Klaas Ruppel wrote: Typographic solutions (as established they ever may be) do not solve encoding matters. Best regards, __ Klaas Ruppel www.kotus.fi/?l=ens=1 Kotus www.kotus.fi Fociswww.focis.fi Tel. +358 207 813 278 Fax +358 207 813 219 Khaled Hosny kirjoitti 10.11.2010 kello 20.03: On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote: From the Pre-Preliminary minutes of UTC #125 (L2/10-416): C.4 Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353] - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf [125-A13] ... UTC does not believe that either solution A or solution B represents an appropriate encoding solution for the text representation problem shown in this document. Appropriate technology involving markup should be applied to the problem of representation of text at this level. This will not happen. Linguists will continue to use their PUA code points (or even their 8-bit fonts), which employ these characters perfectly (albeit using precomposed glyphs for the used combinations). Advanced typesetting engines like TeX (which were invented 30 years ago, mind you) already support wide accents that span multiple characters: $\widehat{abcd}$ $\widetilde{abcd}$ \bye Even math formulas in new MS Office versions can do that (well it is math because, apparently, only mathematicians cared about that, but I don't see why it should not work for linguists too). Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer