RE: Pupil's question about Burmese

2010-11-10 Thread Shawn Steele
FWIW: The OS really likes Unicode, so lots of the text input, etc, are really 
Unicode.  ANSI apps (including non-Unicode web pages), get the data back from 
those controls in ANSI, so you can lose data that it looked like you entered.  

As mentioned, the solution is to fix the app to use Unicode.  Especially for 
a language like this.  In these cases, machines will be fairly inconsistent 
even if they did support some code page, but Unicode works most everywhere.

Usually it's not difficult for a web page to switch to UTF-8.  If it's a form, 
it's even possible that overriding it on your end might get the data posted 
back in UTF-8 and succeed (if you're really lucky), but the real fix is to have 
the web server serve Unicode.

-Shawn

 
http://blogs.msdn.com/shawnste



From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of 
Peter Constable [peter...@microsoft.com]
Sent: Tuesday, November 09, 2010 10:42 PM
To: James Lin; Ed
Cc: Unicode Mailing List
Subject: RE: Pupil's question about Burmese

A non-Unicode web page is like a non-Unicode app. Web pages, and apps, should 
use Unicode.'


Peter

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Tuesday, November 09, 2010 11:24 AM
To: Ed
Cc: Unicode Mailing List
Subject: RE: Pupil's question about Burmese

Oh, don't get me wrong. By having Unicode is like wearing a crown and be a 
king.  It's best thing out there.

What I am referring is, if a web page is not Unicode supported, or any 
applications that do not support Unicode, even if running a windows 7 with 
English locale(even though natively, it supports UTF-16), it is not possible to 
directly copy/paste without having the correct supported locale, if not, you 
may damaging the bytes of the characters which show corruptions.

Even though most modern API is and hopefully written in Unicode calls, not all 
(legacy) applications are written in Unicode, so conversion is still necessary 
to even handling the non-ASCII data.

Let me know if I am still missing something here.

-Original Message-
From: Ed [mailto:ed.tra...@gmail.com]
Sent: Tuesday, November 09, 2010 11:02 AM
To: James Lin
Cc: Unicode Mailing List
Subject: Re: Pupil's question about Burmese


 Yes, displaying is fine, but the original question is copying and
 pasting; without the correct locale settings, you can’t copy/paste
 without corrupting the byte sizes.  Copy/paste is generally handle by
 OS itself, not application.  Even if you have unicode support
 application, you can display, but you can’t handle none-ASCII characters.

Why not?  Modern Win32 OSes use UTF-16.  Presumably most modern applications 
are written using calls to the modern API which should seamlessly support 
copy-and-paste of Unicode text, regardless of script or language -- so long as 
the script or language is supported at the level of displaying the text 
correctly and you have a font that works for that script.  Actually, even if 
the text display is imperfectly (i.e., one sees square boxes when lacking a 
proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex 
Text Layout script like Burmese), copy-and-paste of the raw Unicode text should 
still work correctly.

Is this not the case?




Re: Pupil's question about Burmese

2010-11-10 Thread Keith Stribley

On 11/10/2010 02:17 PM, Shawn Steele wrote:

As mentioned, the solution is to fix the app to use Unicode.  Especially for 
a language like this.  In these cases, machines will be fairly inconsistent even if they 
did support some code page, but Unicode works most everywhere.



Afaik there never has been a standard code page for Myanmar text, 
Unicode was the first time storage of Burmese text was standardised for 
computers. There are several different legacy font families in use for 
Myanmar each with their own slightly different mapping to Latin code 
points. The font in question has a Unicode cmap table, but the map is 
from Latin code points to glyphs, not from Myanmar code points to 
glyphs. There are also several fonts which map incorrectly from the 
Myanmar Unicode block using the Mon, Shan and Karen code points for 
glyph variants so the font can avoid having OpenType/Graphite/AAT rules.


If anyone is having trouble installing genuine Myanmar Unicode fonts, 
then I have some instructions at


http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/gettingStarted.php

Keith





Are Latin and Cyrillic essentially the same script?

2010-11-10 Thread Karl Pentzlin
As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf
= L2/10-356, there exists a Latin letter which resembles the Cyrillic
soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif
variant of the alphabet, which was used for several languages in the
former Soviet Union (e.g. Tatar), and was developed in parallel to the
alphabet nowadays in use for Turk and Azerbaijan, see:
http://en.wikipedia.org/wiki/Janalif .
In fact, it was proposed on this base, being the only Jaꞑalif letter
missing so far, since the ꞑ (occurring in the alphabet name itself)
was introduced with Unicode 6.0.

The letter is no soft sign; it is the exact Tatar equivalent of the
Turkish dotless i, thus it has a similar use as the Cyrillic yeru
Ы/ы (U+042B/U+044B).

In this function, it is a part of the adaptation of the Latin alphabet
for a lot of non-Russian languages in the Soviet Union in the 1920s,
see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941,
http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 .
(A proposal regarding this subject is expected for 2011.)

Thus, it shares with the Cyrillic soft sign its form and partly the
geographical area of its use, but in no case its meaning. Similar can
be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р
(U+0420/U+0440, Cyrillic letter ER).

According to the pre-preliminary minutes of UTC #125 (L2/10-415),
the UTC has not accepted the Latin Ь/ь.

It is an established practice for the European alphabetic scripts to
encode a new letter only if it has a different shape (in at least one
of the capital and small forms) regarding to all already encoded
letter of the same script. The Y/y is well known to denote completely
different pronunciations, used as consonant as well as vocal, even within
the same language. Thus, if somebody unearths a Latin letter E/e in some
obscure minority language which has no E-like vocal, to denote a M-like
sound and in fact to be collated after the M in the local alphabet, this
will probably not lead to a new encoding.

But, Latin and Cyrillic are different scripts (the question in the Re
of this mail is rhetorical, of course).

Admittedly, there also is a precedence for using Cyrillic letters in
Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone
letters in Zhuang. However, the orthography using them was
short-lived, being superseded by another Latin orthography which uses
genuine Latin letters as tone marks (J/j and X/x, in this case).

On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь
did not lose the Ь/ь by an improvement of the orthography, but were
completely deprecated by an ukase of Stalin. Thus, they continue to be
the Latin alphabets of the respective languages.
Whether formally requesting a revival or not, they are regarded as valid
by the members of the cultural group (even if only to access their cultural
inheritance).
Especially, it cannot be excluded that persons want to create Latin domain
names or e-mail addresses without being accused for script mixing.

Taking this into account, not mentioning the technical problems
regarding collation etc. and the typographical issues when it comes to
subtle differences between Latin and Cyrillic in high quality
typography, it is really hard to understand why the UTC refuses to encode
the Latin Ь/ь.

A quick glance at the Юшманов table mentioned above proves that there
is absolutely no request to duplicate the whole Cyrillic alphabet in
Latin, as someone may have feared.

- Karl Pentzlin




Re: Are Latin and Cyrillic essentially the same script?

2010-11-10 Thread Karl Pentzlin
2010-11-10 10:08, I wrote:

KP As shown in N3916 ...

Please read vowel instead of vocal throughout the mail. Sorry.




Combining Triple Diacritics (N3915) not accepted by UTC #125

2010-11-10 Thread Karl Pentzlin
From the Pre-Preliminary minutes of UTC #125 (L2/10-416):

 C.4 Preliminary Proposal to enable the use of Combining Triple
 Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353]
  - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf

 [125-A13] ... UTC does not believe that either solution A or solution B
 represents an appropriate encoding solution for the text
 representation problem shown in this document. Appropriate
 technology involving markup should be applied to the problem of
 representation of text at this level.

This will not happen.
Linguists will continue to use their PUA code points (or even their
8-bit fonts), which employ these characters perfectly (albeit using
precomposed glyphs for the used combinations).

  This is not plain text.

It *is*, at least for the applications in dialectology where groups of
three characters linked by one of the proposed triple diacritics have a
well-defined and documented meaning.

This is also proven by the fact that the existing PUA characters
fulfill perfectly the needs of the relevant academic communities,
except being interchangeable without using special fonts containing
these PUA characters (a request which could be overcome when these
characters are contained in Unicode).

 Processes such as line-breaking do not know about these, or the
 double diacritics, and this creates problems for processes.

Problems are there to be solved, and they are solvable.
E.g., simply state that no line break may occur in the realm of a
diacritic spanning over three letters.

Latin *is* a complex script, anyway.

- Karl Pentzlin





Re: Combining Triple Diacritics (N3915) not accepted by UTC #125

2010-11-10 Thread Khaled Hosny
On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote:
 From the Pre-Preliminary minutes of UTC #125 (L2/10-416):
 
  C.4 Preliminary Proposal to enable the use of Combining Triple
  Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353]
   - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf
 
  [125-A13] ... UTC does not believe that either solution A or solution B
  represents an appropriate encoding solution for the text
  representation problem shown in this document. Appropriate
  technology involving markup should be applied to the problem of
  representation of text at this level.
 
 This will not happen.
 Linguists will continue to use their PUA code points (or even their
 8-bit fonts), which employ these characters perfectly (albeit using
 precomposed glyphs for the used combinations).

Advanced typesetting engines like TeX (which were invented 30 years ago,
mind you) already support wide accents that span multiple characters:

$\widehat{abcd}$
$\widetilde{abcd}$
\bye

Even math formulas in new MS Office versions can do that (well it is
math because, apparently, only mathematicians cared about that, but I
don't see why it should not work for linguists too).

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



RE: Combining Triple Diacritics (N3915) not accepted by UTC #125

2010-11-10 Thread Murray Sargent
You can put diacritics over an arbitrarily large base by using an accent object 
in a math zone. For example, in my email editor (Outlook), I type alt+= to 
insert a math zone and then (a+b)\tildespacespace to get



[cid:image001.png@01CB80BE.389DD340]



(wide tilde over a+b). Evidently linguistic analysis yet another field in which 
mathematical typography is useful.



Murray


inline: image001.png

Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Jim Monty
Here's a peculiar question.

Is there a standard term to describe text that is in some subset CCS of another 
CCS but, strictly speaking, is only really in the subset CCS because it doesn't 
have any characters in it other than those represented in the smaller CCS?

(The fact that I struggled to phrase this question in a way that made my 
meaning 
clear -- and failed -- is precisely my dilemma.)

Text that has in it only characters that are in the 
ASCII character encoding is also in the ISO 8859-1 character encoding and the 
UTF-8 character encoding form of the Unicode coded character set, right? I 
often 
need to talk and write about text that has such multiple personalities, but I 
invariably struggle to make my point clearly and succinctly. I wind up 
describing the notion of it in awkwardly verbose detail.

So I'm left wondering if the character encoding cognoscenti have a special 
utilitarian word for this, maybe one borrowed from mathematics (set theory).

Jim Monty





Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Asmus Freytag
If you want to get that point across to a general audience, you could 
use a more colloquial term, albeit one that itself derives from mathematics.


Text that can be completely expressed in ASCII is fits into something 
(ASCII) that works as a lowest common denominator of a large number of 
character sets.


You could call it lowest common denominator text.

Since ASCII is the only set that exhibits such a lowest common 
denominator relationship with enough other sets to make it interesting, 
and since that relation is so well known, it's usually enough to just 
refer to it by name (ASCII) without needing a general term - except 
perhaps for general audiences that aren't very familiar with it.


In this kinds of discussions I find it invariably useful to mention that 
the copyright sign is not part of ASCII. (I suspect that it's the most 
common character that makes a text lose its lowest common denominator 
status).


A./





On 11/10/2010 11:41 AM, Jim Monty wrote:

Here's a peculiar question.

Is there a standard term to describe text that is in some subset CCS of another
CCS but, strictly speaking, is only really in the subset CCS because it doesn't
have any characters in it other than those represented in the smaller CCS?

(The fact that I struggled to phrase this question in a way that made my meaning
clear -- and failed -- is precisely my dilemma.)

Text that has in it only characters that are in the
ASCII character encoding is also in the ISO 8859-1 character encoding and the
UTF-8 character encoding form of the Unicode coded character set, right? I often
need to talk and write about text that has such multiple personalities, but I
invariably struggle to make my point clearly and succinctly. I wind up
describing the notion of it in awkwardly verbose detail.

So I'm left wondering if the character encoding cognoscenti have a special
utilitarian word for this, maybe one borrowed from mathematics (set theory).

Jim Monty









Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Mark Davis ☕
Mark

*— Il meglio è l’inimico del bene —*


On Wed, Nov 10, 2010 at 12:38, Asmus Freytag asm...@ix.netcom.com wrote:

 If you want to get that point across to a general audience, you could use a
 more colloquial term, albeit one that itself derives from mathematics.

 Text that can be completely expressed in ASCII is fits into something
 (ASCII) that works as a lowest common denominator of a large number of
 character sets.

 You could call it lowest common denominator text.

 Since ASCII is the only set that exhibits such a lowest common denominator
 relationship with enough other sets to make it interesting, and since that
 relation is so well known, it's usually enough to just refer to it by name
 (ASCII) without needing a general term - except perhaps for general
 audiences that aren't very familiar with it.


That is actually not the case. There are superset relations among some of
the CJK character sets, and also -- practically speaking -- between some of
the windows and ISO-8859 sets. I say practically speaking because in general
environments, the C1 controls are really unused, so where a non ISO-8859 set
is same except for 80..9F you can treat it pragmatically as a superset.

What are also tricky are the 'almost' supersets, where there are only a few
different characters. Those definitely cause problems because the difference
in data is almost undetectable.



 In this kinds of discussions I find it invariably useful to mention that
 the copyright sign is not part of ASCII. (I suspect that it's the most
 common character that makes a text lose its lowest common denominator
 status).

 A./






 On 11/10/2010 11:41 AM, Jim Monty wrote:

 Here's a peculiar question.

 Is there a standard term to describe text that is in some subset CCS of
 another
 CCS but, strictly speaking, is only really in the subset CCS because it
 doesn't
 have any characters in it other than those represented in the smaller CCS?

 (The fact that I struggled to phrase this question in a way that made my
 meaning
 clear -- and failed -- is precisely my dilemma.)

 Text that has in it only characters that are in the
 ASCII character encoding is also in the ISO 8859-1 character encoding and
 the
 UTF-8 character encoding form of the Unicode coded character set, right? I
 often
 need to talk and write about text that has such multiple personalities,
 but I
 invariably struggle to make my point clearly and succinctly. I wind up
 describing the notion of it in awkwardly verbose detail.

 So I'm left wondering if the character encoding cognoscenti have a special
 utilitarian word for this, maybe one borrowed from mathematics (set
 theory).

 Jim Monty









RE: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Shawn Steele
Or did you mean this is UTF-8 even though in only has characters that also 
look like ASCII?  I was a bit confused :)

If you are communicating this information, then that's probably also a good 
time to also communicate Use Unicode, like UTF-8, and you won't have this kind 
of problem!

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Asmus Freytag
Sent: Wednesday, November 10, 2010 12:39 PM
To: Jim Monty
Cc: unicode@unicode.org
Subject: Re: Is there a term for 
strictly-just-this-encoding-and-not-really-that-encoding?

If you want to get that point across to a general audience, you could use a 
more colloquial term, albeit one that itself derives from mathematics.

Text that can be completely expressed in ASCII is fits into something
(ASCII) that works as a lowest common denominator of a large number of 
character sets.

You could call it lowest common denominator text.

Since ASCII is the only set that exhibits such a lowest common denominator 
relationship with enough other sets to make it interesting, and since that 
relation is so well known, it's usually enough to just refer to it by name 
(ASCII) without needing a general term - except perhaps for general audiences 
that aren't very familiar with it.

In this kinds of discussions I find it invariably useful to mention that the 
copyright sign is not part of ASCII. (I suspect that it's the most common 
character that makes a text lose its lowest common denominator 
status).

A./





On 11/10/2010 11:41 AM, Jim Monty wrote:
 Here's a peculiar question.

 Is there a standard term to describe text that is in some subset CCS 
 of another CCS but, strictly speaking, is only really in the subset 
 CCS because it doesn't have any characters in it other than those represented 
 in the smaller CCS?

 (The fact that I struggled to phrase this question in a way that made 
 my meaning clear -- and failed -- is precisely my dilemma.)

 Text that has in it only characters that are in the ASCII character 
 encoding is also in the ISO 8859-1 character encoding and the
 UTF-8 character encoding form of the Unicode coded character set, 
 right? I often need to talk and write about text that has such 
 multiple personalities, but I invariably struggle to make my point 
 clearly and succinctly. I wind up describing the notion of it in awkwardly 
 verbose detail.

 So I'm left wondering if the character encoding cognoscenti have a 
 special utilitarian word for this, maybe one borrowed from mathematics (set 
 theory).

 Jim Monty











Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Markus Scherer
Specifically for ASCII, a common term is seven-bit ASCII.
markus


Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Kenneth Whistler
Mark Davis wrote:

 What are also tricky are the 'almost' supersets, where there are only a few
 different characters. Those definitely cause problems because the difference
 in data is almost undetectable.

For example, Mark is referring to cases such as ISO 8859-1 and 8859-15.

Those share all the same encoded characters except those at
the code points 0xA4, 0xA6, 0xA8, 0xB4, 0xB8, and 0xBC..0xBE.

So neither of the repertoires is a proper subset of the other,
but the two coded character sets share the vast majority
of their characters, including almost all of the common ones.

--Ken




Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Tim Greenwood
Even more interesting is Windows 1252 and ISO8859-15 where the former is a
repertoire superset of the latter for the graphic characters, but not an
encoding superset.

On Wed, Nov 10, 2010 at 5:53 PM, Kenneth Whistler k...@sybase.com wrote:

 Mark Davis wrote:

  What are also tricky are the 'almost' supersets, where there are only a
 few
  different characters. Those definitely cause problems because the
 difference
  in data is almost undetectable.

 For example, Mark is referring to cases such as ISO 8859-1 and 8859-15.

 Those share all the same encoded characters except those at
 the code points 0xA4, 0xA6, 0xA8, 0xB4, 0xB8, and 0xBC..0xBE.

 So neither of the repertoires is a proper subset of the other,
 but the two coded character sets share the vast majority
 of their characters, including almost all of the common ones.

 --Ken





Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Jim Monty
I like lowest common denominator as a helpful term. It's familiar and means 
just the right thing, euphemistically.
 
Thank you, Asmus. You groked what I struggled to express.
 
Jim Monty



- Original Message 
From: Asmus Freytag asm...@ix.netcom.com
To: Jim Monty jim.mo...@yahoo.com
Cc: unicode@unicode.org
Sent: Wed, November 10, 2010 1:38:55 PM
Subject: Re: Is there a term for 
strictly-just-this-encoding-and-not-really-that-encoding?

If you want to get that point across to a general audience, you could use a 
more 
colloquial term, albeit one that itself derives from mathematics.

Text that can be completely expressed in ASCII is fits into something (ASCII) 
that works as a lowest common denominator of a large number of character sets.

You could call it lowest common denominator text.

Since ASCII is the only set that exhibits such a lowest common denominator 
relationship with enough other sets to make it interesting, and since that 
relation is so well known, it's usually enough to just refer to it by name 
(ASCII) without needing a general term - except perhaps for general audiences 
that aren't very familiar with it.

In this kinds of discussions I find it invariably useful to mention that the 
copyright sign is not part of ASCII. (I suspect that it's the most common 
character that makes a text lose its lowest common denominator status).

A./


On 11/10/2010 11:41 AM, Jim Monty wrote:
 Here's a peculiar question.
 
 Is there a standard term to describe text that is in some subset CCS of 
another
 CCS but, strictly speaking, is only really in the subset CCS because it 
doesn't
 have any characters in it other than those represented in the smaller CCS?
 
 (The fact that I struggled to phrase this question in a way that made my 
meaning
 clear -- and failed -- is precisely my dilemma.)
 
 Text that has in it only characters that are in the
 ASCII character encoding is also in the ISO 8859-1 character encoding and the
 UTF-8 character encoding form of the Unicode coded character set, right? I 
often
 need to talk and write about text that has such multiple personalities, but I
 invariably struggle to make my point clearly and succinctly. I wind up
 describing the notion of it in awkwardly verbose detail.
 
 So I'm left wondering if the character encoding cognoscenti have a special
 utilitarian word for this, maybe one borrowed from mathematics (set theory).
 
 Jim Monty




Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Martin J. Dürst

On 2010/11/11 6:28, Mark Davis ☕ wrote:


That is actually not the case. There are superset relations among some of
the CJK character sets, and also -- practically speaking -- between some of
the windows and ISO-8859 sets. I say practically speaking because in general
environments, the C1 controls are really unused, so where a non ISO-8859 set
is same except for 80..9F you can treat it pragmatically as a superset.


Yes, except that the terms superset/subset (and set in general) 
shouldn't be used unless you really strictly speak about the repertoire 
of characters, and not the encoding itself. So e.g. the repertoire of 
iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1 
is not a subset of UTF-8, not because you can't label some text encoded 
as iso-8859-1, but because subset relationships among the encodings 
themselves don't make sense).
Also, US-ASCII is not a subset of UTF-8, because when you just use the 
names of the character encodings, you mean the character encodings, and 
character encodings don't have subset relationships.


It may as well be possible to use (create?) the term sub-encoding, 
saying that an encoding A is a sub-encoding of encoding B if all (legal) 
byte sequences in encoding A are also legal byte sequences in encoding B 
and are interpreted as the same characters in both cases. In this sense, 
US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding 
of many other encodings. You can also say that iso-8859-1 is a 
sub-encoding of windows-1252 if the former is interpreted as not 
including the C1 range.


Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp



Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Bjoern Hoehrmann
* Jim Monty wrote:
Is there a standard term to describe text that is in some subset CCS of 
another 
CCS but, strictly speaking, is only really in the subset CCS because it 
doesn't 
have any characters in it other than those represented in the smaller CCS?

(The fact that I struggled to phrase this question in a way that made my 
meaning 
clear -- and failed -- is precisely my dilemma.)

Text that has in it only characters that are in the 
ASCII character encoding is also in the ISO 8859-1 character encoding and the 
UTF-8 character encoding form of the Unicode coded character set, right? I 
often 
need to talk and write about text that has such multiple personalities, but I 
invariably struggle to make my point clearly and succinctly. I wind up 
describing the notion of it in awkwardly verbose detail.

You are asking for a term to say something unambiguously (just this),
but then tell us that you wish to talk about ambiguity (multiple). If
you want to talk about just this then there is no specific instance of
text, so the problem this is X but it could also be Y or Z does not
arise. If you want to talk about multiple then you lack a frame of re-
ference and all the multiple are equivalent.

Fundamentally, I do not think it makes sense to say that some text is in
some encoding. Text is text, you wouldn't pick up a dead-tree kind of
book and say Oh, this is UTF-8 and US-ASCII and ISO-8859-1 encoded be-
cause it uses only letters found in the ASCII repertoire.

If you have a container that contains only bit strings that are UTF-8
encoded sequences of Unicode scalar values, then do not talk about any
specific thing that could go in that container.

If you have a specific sequence of Unicode scalar values and a string of
bits, and want to point out that for that specific bit string many en-
codings map the string to the same sequence of Unicode scalar values,
then I do not see why you would need a specific term.

Perhaps http://en.wikipedia.org/wiki/Polyglot_(computing) is relevant
here.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 



Re: Combining Triple Diacritics (N3915) not accepted by UTC #125

2010-11-10 Thread Khaled Hosny
Or the other way around...

On Thu, Nov 11, 2010 at 08:53:49AM +0200, Klaas Ruppel wrote:
 Typographic solutions (as established they ever may be) do not solve encoding
 matters.
 
 Best regards,
 __
 Klaas Ruppel   www.kotus.fi/?l=ens=1
 Kotus  www.kotus.fi
 Fociswww.focis.fi
 Tel. +358 207 813 278  Fax +358 207 813 219
 
 
 Khaled Hosny kirjoitti 10.11.2010 kello 20.03:
 
 
 On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote:
 
 From the Pre-Preliminary minutes of UTC #125 (L2/10-416):
 
 
 
 C.4 Preliminary Proposal to enable the use of Combining Triple
 
 Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353]
 
  - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf
 
 
 
 [125-A13] ... UTC does not believe that either solution A or
 solution B
 
 represents an appropriate encoding solution for the text
 
 representation problem shown in this document. Appropriate
 
 technology involving markup should be applied to the problem of
 
 representation of text at this level.
 
 
 
 This will not happen.
 
 Linguists will continue to use their PUA code points (or even their
 
 8-bit fonts), which employ these characters perfectly (albeit using
 
 precomposed glyphs for the used combinations).
 
 
 Advanced typesetting engines like TeX (which were invented 30 years ago,
 mind you) already support wide accents that span multiple characters:
 
 $\widehat{abcd}$
 $\widetilde{abcd}$
 \bye
 
 Even math formulas in new MS Office versions can do that (well it is
 math because, apparently, only mathematicians cared about that, but I
 don't see why it should not work for linguists too).
 
 Regards,
 Khaled
 
 --
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer
 
 

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer