subject:"Encoding "

Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode

On Wed, 1 Jan 2020 20:11:04 +
James Kass via Unicode  wrote:

> On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:
> 
>  > That's exactly the sort of mess that jack-booted renderers are
>  > trying to minimise.  Their principle is that there should be only
>  > one encoding per shape, though to be fair:
>  >
>  > 1) some renderers accept canonical equivalents.
>  > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ),
>  > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
>  > 3) Superseded chillu encodings are still supported.  
> 
> There was never any need for atomic chillu form characters.  

> The 
> principle of only one encoding per shape is best achieved when every 
> shape gets an atomic encoding.

I should have written per-word shape.  I should also have added that
most renderers attempt to handle Mongolian, despite its encoding
Middle Mongolian phonetics rather than characters. Also, they don't
attempt to sort the Arabic script per-language subsets out, which
leads to a bad mess at Wiktionary when Unicode characters differ only in
a few forms.

> Glyph-based encoding is incompatible 
> with Unicode character encoding principles.

Visual encoding sometimes works - phonetic order for Thai is so
complicated that it is unsurprising that its definition is partly
missing from Unicode 1.0.  The official history hides behind
incompatibility with the Thai national standard, but phonetic order was
simply too complicated for Thai.  Additionally, Thais don't agree on
where preposed vowels go relative to Pali consonant clusters - they
don't agree that all of them should appear in the middle of the
cluster.  (I suppose the positioning rule could have been made a
stylistic feature of fonts.)

An analogue is Lao collation.  While syllable boundaries can
overwhelmingly be discerned in modern Lao, Lao collations are too
complicated to be accepted for ICU if they are to support anything but
single syllables.  CLDR collation (interpreted as a specification with
the normal use of specification language for the form of definitions)
can just cope, whereas the UCA can't, but the tables are huge. 

Richard.

Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode

On Wed, 1 Jan 2020 23:09:49 +
James Kass via Unicode  wrote:

> On 2020-01-01 8:11 PM, James Kass wrote:
> > It’s too bad that ISCII didn’t accomodate the needs of Vedic
> > Sanskrit, but here we are.  
> 
> Sorry, that might be wrong to say.  It's possible that it's Unicode's 
> adaptation of ISCII that hinders Vedic Sanskrit.

Have you found a definition of the ISCII handling of Vedic characters?

The problem lies in Unicode's failure to standardise the encoding of
Devanagari text.  But for the consistent failure to include a
standardisation of text in a script in TUS, one might wonder if the
original idea was to duck the issue by resorting to canonical
equivalence.

I've been looking at Microsoft's specification of Devanagari character
order.  In
https://docs.microsoft.com/en-us/typography/script-development/devanagari,
the consonant syllable ends

[N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]

where
N is nukta
A is anudatta (U+0952)
H is halant/virama
M is matra
SM is syllable modifier signs
VD is vedic

"Syllable modifier signs" and "vedic" are not defined.  It appears that
SM includes U+0903 DEVANAGARI SIGN VISARGA.

I note that even ग॒ः  is
given a dotted circle by HarfBuzz.  Now, this might not be an entirely
fair test; I suspect anudatta is assigned this position because
originally the Sindhi implosives were encoded as consonant plus nukta
and anudatta, though rendering still fails with HarfBuzz when nukta is
inserted (ग़॒ः).

Richard.

Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode




On 2020-01-01 8:11 PM, James Kass wrote:
It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.


Sorry, that might be wrong to say.  It's possible that it's Unicode's 
adaptation of ISCII that hinders Vedic Sanskrit.

One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode

On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:

> That's exactly the sort of mess that jack-booted renderers are trying
> to minimise.  Their principle is that there should be only one encoding
> per shape, though to be fair:
>
> 1) some renderers accept canonical equivalents.
> 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating
> (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
> 3) Superseded chillu encodings are still supported.

There was never any need for atomic chillu form characters.  The 
principle of only one encoding per shape is best achieved when every 
shape gets an atomic encoding.  Glyph-based encoding is incompatible 
with Unicode character encoding principles.

It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.

Re: A neat description of encoding characters

2019-12-02 Thread Mark E. Shoulson via Unicode


On 12/2/19 7:01 AM, Costello, Roger L. via Unicode wrote:

>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, 
p.74-75


It's a reasonably good explanation of binary numbers and "encoding" in a 
more usual sense than we use it here in Unicode-land.  Actually makes 
for a basis to move on to discussing information theory.  But when 
Unicodites say "encoding", they mean stuff like UTF-8 vs UTF-16, which 
is kind of a different kettle of macaroons.


~mark

Re: A neat description of encoding characters

2019-12-02 Thread James Kass via Unicode





On 2019-12-03 12:59 AM, Richard Wordingham via Unicode wrote:

On Mon, 2 Dec 2019 12:01:52 +
"Costello, Roger L. via Unicode"  wrote:


 From the book titled "Computer Power and Human Reason" by Joseph
Weizenbaum, p.74-75

Suppose that the alphabet with which we wish to concern ourselves
consists of 256 distinct symbols...

Why should I wish to concern myself with only one alphabet?

You shouldn't.  But suppose you did.  That's the hypothetical set-up for 
the illustration.


When that book was published in 1976, that illustration may have helped 
some people gain a better understanding of computer encoding.


Nowadays a character string might be required to produce a glyph which 
the user community considers to be a "character" (or letter) in its 
writing system.  Adding variation selectors, invisible 'formatting' 
characters, and non-alphabetic symbols to the mix has moved computer 
encoding way beyond 1976.

Re: A neat description of encoding characters

2019-12-02 Thread Richard Wordingham via Unicode

On Mon, 2 Dec 2019 12:01:52 +
"Costello, Roger L. via Unicode"  wrote:

> From the book titled "Computer Power and Human Reason" by Joseph
> Weizenbaum, p.74-75
> 
> Suppose that the alphabet with which we wish to concern ourselves
> consists of 256 distinct symbols...

Why should I wish to concern myself with only one alphabet?

Richard.

Re: A neat description of encoding characters

2019-12-02 Thread James Tauber via Unicode

Indeed.

Unicode separates: (1) selecting a character repertoire; (2) assigning each
character a numerical character code; (3) choosing an encoding form to
represent those character codes as code units (made up of bytes).

(2) and (3) are not conflated.

James


On Mon, Dec 2, 2019 at 9:54 AM 梁海 Liang Hai via Unicode 
wrote:

> Grrr… It’s an okayish analog for binary numbers, but not really relevant
> to character encoding. Encoded characters are just assigned with integers,
> which could in turn be represented in any base.
>
> The binary nature of computers’ way of storing numbers does not have much
> to do with how character encoding works—unless you really want to start
> explaining character encoding with those so basic ideas such as “What is
> electricity?”, “What is a computer?”, …
>
> Best,
> 梁海 Liang Hai
> https://lianghai.github.io
>
> > On Dec 2, 2019, at 20:01, Costello, Roger L. via Unicode <
> unicode@unicode.org> wrote:
> >
> > From the book titled "Computer Power and Human Reason" by Joseph
> Weizenbaum, p.74-75
> >
> > Suppose that the alphabet with which we wish to concern ourselves
> consists of 256 distinct symbols. Imagine that we have a deck of 256 cards,
> each of which has a distinct symbol of our alphabet printed on it, and, of
> course, such that there corresponds one card to each symbol. How many
> questions that can be answered "yes" or "no" would one have to ask, given
> one card randomly selected from the deck, in order to be able to decide
> which character is printed on the card? We can certainly make the decision
> by asking at most 256 questions. We can somehow order the symbols and begin
> by asking if it is the first in our ordering, e.g., "It is an uppercase A?"
> If the answer is "no," then we ask if it is the second, and so on. But if
> our ordering is known both to ourselves and to our respondent, there is a
> much more economical way of organizing our questioning. We ask whether the
> character we are seeking is in the first half of the set. Whatever the
> answer, we will have isolated a s!
>  et!
> >  of 128 characters among the character we seek resides. We again ask
> whether it is in the first half of that smaller set, and so on. Proceeding
> in this way, we are bound to discover what character is printed on the
> selected card by asking exactly eight questions. We could have recorded the
> answers we received to our questions by writing "1" whenever the answer was
> "yes" and "0" whenever it was "no." That record would then consist of eight
> so-called bits each of which is either "1" or "0". This eight-bit string is
> then an unambiguous representation of the character we are seeking.
> Moreover, each character of the whole set has a unique eight-bit
> representation within the same ordering.
> >
>
>
>

-- 
*James Tauber*
Eldarion <https://eldarion.com/> | Scaife Viewer
<https://scaife-viewer.org/> | jktauber.com (Greek Linguistics)
<https://jktauber.com/> | Modelling Music
<https://modelling-music.com/> | Digital
Tolkien <https://digitaltolkien.com/>
Subscribe to my email newsletter <https://buttondown.email/jtauber>!

Re: A neat description of encoding characters

2019-12-02 Thread 梁海 Liang Hai via Unicode

Grrr… It’s an okayish analog for binary numbers, but not really relevant to 
character encoding. Encoded characters are just assigned with integers, which 
could in turn be represented in any base.

The binary nature of computers’ way of storing numbers does not have much to do 
with how character encoding works—unless you really want to start explaining 
character encoding with those so basic ideas such as “What is electricity?”, 
“What is a computer?”, …

Best,
梁海 Liang Hai
https://lianghai.github.io

> On Dec 2, 2019, at 20:01, Costello, Roger L. via Unicode 
>  wrote:
> 
> From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, 
> p.74-75
> 
> Suppose that the alphabet with which we wish to concern ourselves consists of 
> 256 distinct symbols. Imagine that we have a deck of 256 cards, each of which 
> has a distinct symbol of our alphabet printed on it, and, of course, such 
> that there corresponds one card to each symbol. How many questions that can 
> be answered "yes" or "no" would one have to ask, given one card randomly 
> selected from the deck, in order to be able to decide which character is 
> printed on the card? We can certainly make the decision by asking at most 256 
> questions. We can somehow order the symbols and begin by asking if it is the 
> first in our ordering, e.g., "It is an uppercase A?" If the answer is "no," 
> then we ask if it is the second, and so on. But if our ordering is known both 
> to ourselves and to our respondent, there is a much more economical way of 
> organizing our questioning. We ask whether the character we are seeking is in 
> the first half of the set. Whatever the answer, we will have isolated a s!
 et!
>  of 128 characters among the character we seek resides. We again ask whether 
> it is in the first half of that smaller set, and so on. Proceeding in this 
> way, we are bound to discover what character is printed on the selected card 
> by asking exactly eight questions. We could have recorded the answers we 
> received to our questions by writing "1" whenever the answer was "yes" and 
> "0" whenever it was "no." That record would then consist of eight so-called 
> bits each of which is either "1" or "0". This eight-bit string is then an 
> unambiguous representation of the character we are seeking. Moreover, each 
> character of the whole set has a unique eight-bit representation within the 
> same ordering. 
>

A neat description of encoding characters

2019-12-02 Thread Costello, Roger L. via Unicode

>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, 
>p.74-75

Suppose that the alphabet with which we wish to concern ourselves consists of 
256 distinct symbols. Imagine that we have a deck of 256 cards, each of which 
has a distinct symbol of our alphabet printed on it, and, of course, such that 
there corresponds one card to each symbol. How many questions that can be 
answered "yes" or "no" would one have to ask, given one card randomly selected 
from the deck, in order to be able to decide which character is printed on the 
card? We can certainly make the decision by asking at most 256 questions. We 
can somehow order the symbols and begin by asking if it is the first in our 
ordering, e.g., "It is an uppercase A?" If the answer is "no," then we ask if 
it is the second, and so on. But if our ordering is known both to ourselves and 
to our respondent, there is a much more economical way of organizing our 
questioning. We ask whether the character we are seeking is in the first half 
of the set. Whatever the answer, we will have isolated a set!
  of 128 characters among the character we seek resides. We again ask whether 
it is in the first half of that smaller set, and so on. Proceeding in this way, 
we are bound to discover what character is printed on the selected card by 
asking exactly eight questions. We could have recorded the answers we received 
to our questions by writing "1" whenever the answer was "yes" and "0" whenever 
it was "no." That record would then consist of eight so-called bits each of 
which is either "1" or "0". This eight-bit string is then an unambiguous 
representation of the character we are seeking. Moreover, each character of the 
whole set has a unique eight-bit representation within the same ordering.

Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode

Le lun. 11 nov. 2019 à 17:31, Markus Scherer  a
écrit :

> We generally assign the script code when the script is in the pipeline for
> a near-future version of Unicode, which demonstrates that it's "a candidate
> for encoding". We also want the name of the script to be settled, so that
> the script code can be roughly mnemonic for the name.
>

This is not true for some scripts that have been encoded since long in ISO
15924, not all with a proposal candidate for encoding (notably the various
Tolkien's invented scripts, Cirth, Tengwar, ... and Klingon, which all have
limited use and active supporters).

Other scripts were added even without lot of evidence, or that are not even
deciphered (Mayan hieroglyphs, Linear A...). There are also missing scripts
in India which are still in contemporary use and important for the local
cultures (but with limited support in specific states or smaller
communities at subregional level only), in Myanmar/Burma, and in aboriginal
communities some southern Indonesian islands (I think there are also some
aboriginal logographic scripts in Australia, and other Precolombian scripts
in Central and South America and very remote islands in Southern Pacific,
and still in North-eastern Russia/Beringia).

Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode

Names of this script can very a bit "Nsibidi", "Nsibiri", but not a lot
(d/r variation may be phonetic remonization in one of the supported
languages). It is stable across various sites.

Uniqueness is quite easy to assert, there's not a lot of ideographic
scripts, at least in modern use. But still not as complex as Chinese
scripts. The site speaks about a inventory of about 500 base characters (in
the first educational books), probably the double (in which case it
compares to the modern use of sinograms in China for children, whereas
adults use only about 2000 signs for almost everything, compare to the same
average of 2000 common words in Indo-European languages, and in Afroasiatic
or Nilo-Saharan languages; Igbo is still a minority language, and most of
their speakers have low level of litteracy, even in Latin or Arabic scripts
and due to the proliferation of vernacualr languages, they may as well use
about 500-1000 basic words to understand each other).

anyway, I suppose that you were already aware of that script, but were just
looking for more evidences to have some comparative researches from a few
more sources (lack of interest or finances for linguistic projects in
Africa, that prefer placing their efforts in major scripts that have
official national support in their educational and cultural programs:
Latin, Arabic, Ethiopic, Tifinagh; other scripts are still of interest due
to their important historic background and centuries of propagation across
countries or caused by wars, invasions, diplomacy, or commercial interests)

Le lun. 11 nov. 2019 à 17:31, Markus Scherer  a
écrit :

> On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> But first there's still no code in ISO 15924 (first step easy to complete
>> before encoding in the UCS).
>>
>
> That's not first; it's nearly last.
>
> The script code standard says "In general, script codes shall be added to
> ISO 15924 when the script has been coded in ISO/IEC 10646, and when the
> script is agreed, by experts in ISO 15924/RA-JAC to be unique and a *candidate
> for encoding in the UCS*."
>
> We generally assign the script code when the script is in the pipeline for
> a near-future version of Unicode, which demonstrates that it's "a candidate
> for encoding". We also want the name of the script to be settled, so that
> the script code can be roughly mnemonic for the name.
>
> markus
>

Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Markus Scherer via Unicode

On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> But first there's still no code in ISO 15924 (first step easy to complete
> before encoding in the UCS).
>

That's not first; it's nearly last.

The script code standard says "In general, script codes shall be added to
ISO 15924 when the script has been coded in ISO/IEC 10646, and when the
script is agreed, by experts in ISO 15924/RA-JAC to be unique and a *candidate
for encoding in the UCS*."

We generally assign the script code when the script is in the pipeline for
a near-future version of Unicode, which demonstrates that it's "a candidate
for encoding". We also want the name of the script to be settled, so that
the script code can be roughly mnemonic for the name.

markus

Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode

Encoding the Nsibidi script (African) for writing the Efik, Ekoi, Ibibio,
Igbo language.

See this site as an example of use, with links to published educational
books.
http://blog.nsibiri.org/
Also this online dictionary:
https://fr.scribd.com/doc/281219778/Ikpokwu

Other links:
https://en.wikipedia.org/wiki/Nsibidi

But first there's still no code in ISO 15924 (first step easy to complete
before encoding in the UCS).

Re: Encoding colour (from Re: Encoding italic)

2019-02-13 Thread Asmus Freytag via Unicode


  
  
On 2/13/2019 5:19 PM, Mark E. Shoulson
  via Unicode wrote:

 And
  again, all this is before we even consider other issues; I can't
  shake the feeling that there security nightmares lurking inside
  this idea.

Default ignorables are bad juju.
A./

Re: Encoding colour (from Re: Encoding italic)

2019-02-13 Thread Mark E. Shoulson via Unicode


On 2/12/19 12:05 PM, Kent Karlsson via Unicode wrote:

Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode"
:


On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:

Continuing too look deep into the crystal ball, doing some more
hand swirls...

...

...

The scheme quoted (far) below (from wjgo_10009), or anything like it,
will NEVER be part of Unicode!

Not in Unicode, but I have to say I'm intrigued by the idea of writing
HTML with tag characters (not even necessarily "restricted" HTML: the
whole deal).  This does NOT make it possible to write "italics in plain
text," since you aren't writing plain text.  But what you can do is
write rich text (HTML) that Just So Happens to look like plain text when
rendered with a plain-text-renderer (and maybe there could be
plain-text-renderers that straddle the line, maybe supporting some
limited subset of HTML and doing boldface and italics or something.

And so would ESC/command sequences as such, if properly skipped for display.
If some are interpreted, those would affect the display of other characters.
Just like "HTML in tag characters" would. A show invisibles mode would
display both ESC/command sequences as well as "HTML in tag characters"
characters.
Very true.  Maybe the explicitness of HTML appealed to me; escape 
sequences feel more like... you know, computer "codes" and all. (which 
of course is what all this is anyway!  So what's wrong with that?)

BUT, this would NOT be a Unicode feature/catastrophe at all.  This would
be purely the decision of the committee in charge of HTML/XML and
related standards, to decide to accept Unicode tag characters as if they
were ASCII for the purposes of writing XML tags/attributes   It's

I have no say on HTML/CSS, but I would venture to predict that those
who do have a say, would not be keen on that idea. And XML tags in
general need not be in ASCII. And... identifiers in CSS need not
be in pure ASCII either... And attribute values, like filenames
including those that refer to CSS files (CSS is preferably stored
separately from the HTML/XML), certainly need not be pure ASCII.)

So, no, I'd say that that idea is completely dead.


You're probably right, and CSS is practically a different animal, and I 
guess at best one would have to settle for a stripped-down version of 
HTML (in which case, why bother?)  And again, all this is before we even 
consider other issues; I can't shake the feeling that there security 
nightmares lurking inside this idea.


~mark

Re: Encoding colour (from Re: Encoding italic)

2019-02-13 Thread wjgo_10...@btinternet.com via Unicode


Philippe Verdy replied to my post, including quoting me.

WJGO >>  Thinking about this further, for this application copies of the 
glyphs could be redesigned so as to be square and could be emoji-style 
and the meanings of the characters specifying which colour component is 
to be set could be changed so that they refer to the number previously 
entered using one or more of the special  digit characters. Thus the 
setting of colour components could be done in the same reverse notation 
way that the FORTH computer language works.


PV > FORTH is not relevant to this discussion.

I just mentioned FORTH because of the way that numbers are entered 
before the operators that act upon them. I have no intention to use a 
stack-based system: what I have in mind at present is much simpler than 
such a format.


Suppose that there are sixteen new characters, which are in plane 1 or 
maybe plane 14, but which for this mailing list post I will express 
using the digits 0 .. 9, Z, R, G, B, A, F.


There would be a virtual machine to set the colour, that would have 
registers h, r, g, b, a and a system service 
Set_Foreground_Colour(r,g,b,a).


Then the sixteen new characters would each have a default glyph, which 
could be displayed emoji-style, and, in an application environment that 
has the virtual machine available and switched on, would have the 
following effects in the virtual machine and their glyphs would not then 
be displayed. The virtual machine would be sandboxed.


Z h:=0;
0 h:=10*h ;
1 h:=10*h + 1;
2 h:=10*h + 2;
3 h:=10*h + 3;
4 h:=10*h + 4;
5 h:=10*h + 5;
6 h:=10*h + 6;
7 h:=10*h + 7;
8 h:=10*h + 8;
9 h:=10*h + 9;
R r:=h; h:=0;
G g:=h; h:=0;
B b:=h; h:=0;
A a:=h; h:=0;
F Set_Foreground_Colour(r,g,b,a);

Thus for example, remembering that these ordinary characters are just 
being used here for explanation in this post, and that the actual 
characters if encoded would probably be in plane 1 or plane 14:


So the sequence Z128R160G248B255AF could be used to set the foreground 
colour to an opaque blue colour.


It may be that upon investiation there could be specified a feature of 
the system service Set_Foreground_Colour(r,g,b,a) such that "if a=0 then 
a:=255;" so that total opacity of the colour is presumed unless 
otherwise set.


PV > You may create your "proof of concept" (tested on limited 
configurations) but it will just be private


Yes.

PV > [And so it should use PUA for full compatibility ...

Yes, I have in mind to use U+EA60 through to U+EA69 for the digits, as 
U+EA60 is Alt 6 so it makes it easier if some of the people who want 
to experiment want to enter characters using the Alt method.


William Overington
Monday 11 February 2019

Re: Vendor-assigned emoji (was: Encoding italic)

2019-02-13 Thread wjgo_10...@btinternet.com via Unicode


James Kass wrote:

Nobody disagreed and I think it’s a splendid suggestion.  If anyone is 
discussing drafting a proposal to accomplish this, please include me 
in the “cc”.


I too would like to receive copies of any discussions please.

In relation to the proposal, I opine that the facility should not allow 
a glyph that has been assigned to be changed at a later date.


Given that discussion is about a whole plane of code points being 
assigned, then even if the code points are assigned at fifty every month 
that would take over one hundred years to fill a whole plane. Certainly 
early months might have more than fifty allocations.


It is important to have stability as otherwise archived messages could 
have their meaning retrospectively changed with no easy way to find out 
the original meaning.


William Overington
Tuesday 12 February 2019

Re: Encoding italic

2019-02-12 Thread Kent Karlsson via Unicode



Oh, the crystal ball is pure solid state, no moving or hot parts.
A magic 8-ball on the other hand can easily get jammed...

(Now, enough of that...)

/K


Den 2019-02-12 02:57, skrev "James Kass via Unicode" :

> 
> On 2019-02-11 6:42 PM, Kent Karlsson wrote:
> 
>> Using a VS to get italics, or anything like that approach, will
>> NEVER be a part of Unicode!
> 
> Maybe the crystal ball is jammed.  This can happen, especially on the
> older models which use vacuum tubes.
> 
> Wanting a second opinion, I asked the magic 8 ball:
> ³Will VS14 italic be part of Unicode?²
> The answer was:
> ³It is decidedly so.²
>

Re: Encoding colour (from Re: Encoding italic)

2019-02-12 Thread Kent Karlsson via Unicode

Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode"
:

> On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:
>> Continuing too look deep into the crystal ball, doing some more
>> hand swirls...
>> 
>> ...
>> 
>> ...
>> 
>> The scheme quoted (far) below (from wjgo_10009), or anything like it,
>> will NEVER be part of Unicode!
> 
> Not in Unicode, but I have to say I'm intrigued by the idea of writing
> HTML with tag characters (not even necessarily "restricted" HTML: the
> whole deal).  This does NOT make it possible to write "italics in plain
> text," since you aren't writing plain text.  But what you can do is
> write rich text (HTML) that Just So Happens to look like plain text when
> rendered with a plain-text-renderer (and maybe there could be
> plain-text-renderers that straddle the line, maybe supporting some
> limited subset of HTML and doing boldface and italics or something. 

And so would ESC/command sequences as such, if properly skipped for display.
If some are interpreted, those would affect the display of other characters.
Just like "HTML in tag characters" would. A show invisibles mode would
display both ESC/command sequences as well as "HTML in tag characters"
characters.

> BUT, this would NOT be a Unicode feature/catastrophe at all.  This would
> be purely the decision of the committee in charge of HTML/XML and
> related standards, to decide to accept Unicode tag characters as if they
> were ASCII for the purposes of writing XML tags/attributes   It's

I have no say on HTML/CSS, but I would venture to predict that those
who do have a say, would not be keen on that idea. And XML tags in
general need not be in ASCII. And... identifiers in CSS need not
be in pure ASCII either... And attribute values, like filenames
including those that refer to CSS files (CSS is preferably stored
separately from the HTML/XML), certainly need not be pure ASCII.)

So, no, I'd say that that idea is completely dead.

/Kent K

> totally nothing to do with Unicode, unless the XML folks want Unicode to
> change some properties on the tag chars or something.  I think it's a...
> fascinating idea, and probably has *disastrous* consequences lurking
> that I haven't tried to think of yet, but it's not a Unicode idea.
> 
> ~mark
>

Vendor-assigned emoji (was: Encoding italic)

2019-02-11 Thread James Kass via Unicode




On 2019-01-24 Andrew West wrote,

> The ESC and UTC do an appallingly bad job at regulating emoji, and I
> would like to see the Emoji Subcommittee disbanded, and decisions on
> new emoji taken away from the UTC, and handed over to a consortium or
> committee of vendors who would be given a dedicated vendor-use emoji
> plane to play with (kinda like a PUA plane with pre-assigned
> characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which
> the vendors can then associate with glyphs as they see fit; and as
> emoji seem to evolve over time they would be free to modify and
> reassign glyphs as they like because the Unicode Standard would not
> define the meaning or glyph for any characters in this plane).

Nobody disagreed and I think it’s a splendid suggestion.  If anyone is 
discussing drafting a proposal to accomplish this, please include me in 
the “cc”.

Re: Encoding italic

2019-02-11 Thread James Kass via Unicode




Philippe Verdy wrote,

>>> case mappings,
>>
>> Adjust them as needed.
>
> Not so easy: case mappings cannot be fixed. They are stabilized in 
Unicode.

> You would need special casing rules under a specific "locale" for maths.

In BabelPad, I can select a string of text and convert it to math 
italics.  If upper case italics is desired, it would be necessary to 
select the text, convert it back to ASCII, convert it to upper case, and 
convert that upper case to math italics.  Casing the math alphanumerics 
doesn’t seem to present any problem.  Any program could make those 
interim steps invisible to the end user.


(With VS14, BabelTags mark-up, or new control character(s)—casing isn’t 
even an issue.)

Re: Encoding colour (from Re: Encoding italic)

2019-02-11 Thread Mark E. Shoulson via Unicode


On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:

Continuing too look deep into the crystal ball, doing some more
hand swirls...

...

...

The scheme quoted (far) below (from wjgo_10009), or anything like it,
will NEVER be part of Unicode!


Not in Unicode, but I have to say I'm intrigued by the idea of writing 
HTML with tag characters (not even necessarily "restricted" HTML: the 
whole deal).  This does NOT make it possible to write "italics in plain 
text," since you aren't writing plain text.  But what you can do is 
write rich text (HTML) that Just So Happens to look like plain text when 
rendered with a plain-text-renderer  (and maybe there could be 
plain-text-renderers that straddle the line, maybe supporting some 
limited subset of HTML and doing boldface and italics or something.  
BUT, this would NOT be a Unicode feature/catastrophe at all.  This would 
be purely the decision of the committee in charge of HTML/XML and 
related standards, to decide to accept Unicode tag characters as if they 
were ASCII for the purposes of writing XML tags/attributes   It's 
totally nothing to do with Unicode, unless the XML folks want Unicode to 
change some properties on the tag chars or something.  I think it's a... 
fascinating idea, and probably has *disastrous* consequences lurking 
that I haven't tried to think of yet, but it's not a Unicode idea.


~mark

Re: Encoding italic

2019-02-11 Thread James Kass via Unicode

On 2019-02-11 6:42 PM, Kent Karlsson wrote:

> Using a VS to get italics, or anything like that approach, will
> NEVER be a part of Unicode!

Maybe the crystal ball is jammed.  This can happen, especially on the 
older models which use vacuum tubes.

Wanting a second opinion, I asked the magic 8 ball:
“Will VS14 italic be part of Unicode?”
The answer was:
“It is decidedly so.”

Re: Encoding italic

2019-02-11 Thread Kent Karlsson via Unicode

Den 2019-02-11 10:55, skrev "wjgo_10...@btinternet.com via Unicode"
:

> Doug Ewell wrote:
> 
>> , just as next to nobody is using the proposed VS14 mechanism 
> 
> Well, of course not because use of VS14 in a plain text document to
> record a request for an italic glyph version is not at the present time
> an official part of Unicode.

Looking deeply into the crystal ball, swirling my hands over it...

...

...

Using a VS to get italics, or anything like that approach, will
NEVER be a part of Unicode!

/Kent K

Re: Encoding italic

2019-02-11 Thread wjgo_10...@btinternet.com via Unicode


Doug Ewell wrote:


…, just as next to nobody is using the proposed VS14 mechanism …


Well, of course not because use of VS14 in a plain text document to 
record a request for an italic glyph version is not at the present time 
an official part of Unicode. The next scheduled Unicode Technical 
Committee meeting is due to start on 30 April 2019.


Here is a link to the proposal document.

https://www.unicode.org/L2/L2019/19063-italic-vs.pdf

VS14 is used to indicate a request for an italic glyph version in my 
VS14 Maquette font but that is clearly just a maquette font for 
experimental use to test the concept and show that it works. An 
application program that supports OpenType and that has the liga table 
switched on is needed in order to use the VS14 Maquette font to 
demonstrate that the use of VS14 in this way works.


https://forum.high-logic.com/viewtopic.php?f=10=7831

William Overington

Monday 11 February 2019

Re: Encoding colour (from Re: Encoding italic)

2019-02-11 Thread Philippe Verdy via Unicode

Le dim. 10 févr. 2019 à 02:33, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> Previously I wrote:
>
> > A stateful method, though which might be useful for plain text streams
> > in some applications, would be to encode as characters some of the
> > glyphs for indicating colours and the digit characters to go with them
> > from page 5 and from page 3 of the following publication.
>
> > http://www.users.globalnet.co.uk/~ngo/locse027.pdf
>
> Thinking about this further, for this application copies of the glyphs
> could be redesigned so as to be square and could be emoji-style and the
> meanings of the characters specifying which colour component is to be
> set could be changed so that they refer to the number previously entered
> using one or more  of the special  digit characters. Thus the setting of
> colour components could be done in the same reverse notation way that
> the FORTH computer language works.
>

FORTH is not relevant to this discussion. Anyway the usual order for Forth
operators (Forth is a stack-based language, similar to PostScript, and
working like calculators using the Polish reversed order) is to push the
operands from left to right and then use the operator which will pop them
in reverse order from right to left before pushing the result on the stack
(so "a/b/c" becomes "/a get /b get div /c get div"). But colors are just an
operator like "rgb(r,b,g)" and the natural order in stack based languages
should also be "/r get /g get /b get rgb".
Note that C/C++ (with C calling conventions) usually use another order for
its stack, pushing parameters from right to left (if they are not passed
via dedicated registers in fix order, the first parameter from the right
that fits a register being not passed in the stack but on the "main"
accumulator register, possibly a pair or registers for long integer or long
pointers, or a different register for floatting points if floatting point
registers are used).

There's no standard for the order of parameters in stack based languages.
It is arbitrary and specific to each language or specific implementations
of them. So if you want to create your own scripting language to support
your non-standard extension, you can choose any order you want, but this
will still not define a standard related to other languages that have never
been bound to a specific evaluation/encoding order. Then don't pretend it
will be part of the Unicode standard, which is not a scripting language and
that does not offer an "ABI" for stateful encodings with arbitarily long
contexts (Unicode has placed very low limits on the maximum length of
lookahead needed to process text, your extension would not work under these
reasonnable limits, so it will have limited private use and cannot be part
of TUS).

You may create your "proof of concept" (tested on limited configurations)
but it will just be private

[And so it should use PUA for full compatibility and not abuse the other
standardized code points, as your extension would not be
compatible/conforming to the existing rules and limits, without amending
them and discussing a lot how existing conforming applications can be
adapted, and analyzing the effects if they are not updated. Approving this
extension is another thing, and it will need to pass the standard process
to be added to the proposals schedule, pass through the two technical
comities, pass the alpha and beta phases, and then the prepublication.
You'll also need to work on documentations and fix many quirks found in
them, then you'll need supporters to pass the vote (and if you're not an
UTC member or an ISO member, you will never be able to vote for it: you
need then to convince the voters by listening what they remark and refine
your specifications to match their desires, and probably to split your
proposal in several parts or limit your initial goals, leaving the other
problematic poitns for later; if what remains "stable" in your proposal may
not be usable in practice without the additional extensions still in
discussion, and in fact this subset may still remain in the encoding queue
for years, until it reaches a point where it starts being usable for
practical problems; before that, you'll have to experiment with private-use
and should be ready to accept competing proposals, not compatible with your
proposal, and learn from them to reach an acceptable consensus; reaching
that consensus is the longest step but initially most voters will not
decide for or against your proposal if they are not confident enough about
the merit of each proposal, because they want to preserve a resasonnable
compatibility across TUS versions and with existing applications without
adding further problems, notably in terms of confusability/security. But
don't ask them to break the existing stability rules which were eve

Re: Encoding italic

2019-02-11 Thread Philippe Verdy via Unicode

Le dim. 10 févr. 2019 à 16:42, James Kass via Unicode 
a écrit :

>
> Philippe Verdy wrote,
>
>  >> ...[one font file having both italic and roman]...
>  > The only case where it happens in real fonts is for the mapping of
>  > Mathematical Symbols which have a distinct encoding for some
>  > variants ...
>
> William Overington made a proof-of-concept font using the VS14 character
> to access the italic glyphs which were, of course, in the same real
> font.  Which means that the developer of a font such as Deja Vu Math TeX
> Gyre could set up an OpenType table mapping the Basic Latin in the font
> to the italic math letter glyphs in the same font using the VS14
> characters.  Such a font would work interoperably on modern systems.
> Such a font would display italic letters both if encoded as math
> alphanumerics or if encoded as ASCII plus VS14.  Significantly, the
> display would be identical.
>
>  > ...[math alphanumerics]...
>  > These were allowed in Unicode because of their specific contextual
>  > use as distinctive symbols from known standards, and not for general
>  > use in human languages
>
> They were encoded for interoperability and round-tripping because they
> existed in character sets such as STIX.  They remain Latin letter form
> variants.  If they had been encoded as the variant forms which
> constitute their essential identity it would have broken the character
> vs. glyph encoding model of that era.  Arguing that they must not be
> used other than for scientific purposes is just so much semantic
> quibbling in order to justify their encoding.
>
> Suppose we started using the double struck ASCII variants on this list
> in order to note Unicode character numbers such as 핌+픽피픽픽 or
> 핌+ퟚퟘퟞퟘ?  Hexadecimal notation is certainly math and Unicode can be
> considered a science.  Would that be “math abuse” if we did it?  (Is
> linguistics not a science?)
>
>  > (because these encodings are defective and don't have the necessary
>  > coverage, notably for the many diacritics,
>
> The combining diacritics would be used.
>
Not for the many precombined characters that are in Latin: do you intend to
propose them to be reencoded with all the same variants encoded for maths?
Or allow the maths symbols to have diacritics added on them (hint: this
does not work correctly with the specific mathematical conventions on
diacritics and their specific stacking rules: they are NOT reorderable
through canonical equivalence, the order is significant in maths, so you
would also need to use CGJ to fix the expected logical semantic and visual
stacking order).

>
>  > case mappings,
>
> Adjust them as needed.
>

Not so easy: case mappings cannot be fixed. They are stabilized in Unicode.
You would need special casing rules under a specific "locale" for maths.

Really maths is a specific script even if it borrows some symbols from
Latin, Greek or Hebrew but only in specific glyph variants. These symbols
should not be even considered as part of the script they originate from
(just like Latin A is not the same as Cyrillic A or Greek Alpha, that all
have the same forms and the same origin).

I can argue tyhe same thing about IPA notations: they are NOT the Latin
script and also borrow some letter forms from Latin and Greek, but without
any case mappings (only lowercase is used), and also with specific glyph
variants.

Both examples are technical notations which do not obey the linguistic
rules and normal processing of the script they originate from. They are
specific "writing systems", unfortunaltely confused within "Unicode
scripts", and then abused.

Note that some Latin letters have been borrowed from IPA too, for use in
African languages, then case mappings were needed: these should have been
reencoded as a plain letter pair with a basic case mapping (not the special
case mapping rules now needed for African languages, such as open o which
looks much like the mirrored c from Latin Roman digits, and open e which
was borrowed from Greek epsilon in lowercase but does not use the uppercase
Greek Epsilon and uses instead another shape, meaning that the Latin open e
should have been encoded as a plain letter pair, distinct from the Greek
epsilon; but IPA already used the epsilon-like symbol...).

At end these exceptions just cause many inconsistancies and complexities.
Applications and libraries cannot adapt easily and are not downward
compatible because stable properties are immutable and specific tailorings
are needed each time in applications: the more we add these exceptions, the
less the standard is easy to adapt and compatibility is much more difficult
to preserve. In summary I don't like at all the dual encodings or encodings
of additional letters that cannot use the normal stable properties (and
this remark is also true for emojis: what a mess ! full of exceptions and
different incoherent encoding models !)

Re: Encoding italic

2019-02-10 Thread Kent Karlsson via Unicode





Den 2019-02-10 16:31, skrev "James Kass via Unicode" :

> 
> Philippe Verdy wrote,
> 
>>> ...[one font file having both italic and roman]...

For OpenType fonts, there is a "design axis" called "ital". Value 0 on that
axis would be roman (upright, normally), and value 1 on that axis would be
italic. I don't know to what extent that is available in OpenType fonts in
common use... (Instead of using two separate font files.)

[math chars]
> They were encoded for interoperability and round-tripping because they
> existed in character sets such as STIX. 

They were basically requested "by" STIX, yes. Not sure about the
round-tripping bit.

> They remain Latin letter form
> variants.  If they had been encoded as the variant forms which
> constitute their essential identity it would have broken the character
> vs. glyph encoding model of that era.  Arguing that they must not be
> used other than for scientific purposes

I don't think that particular argument was made, IIUC.

> is just so much semantic
> quibbling in order to justify their encoding.
> 
> Suppose we started using the double struck ASCII variants on this list
> in order to note Unicode character numbers such as 핌+픽피픽픽 or
> 핌+ퟚퟘퟞퟘ? 

That particular example would be ok (event though outside of a
conventional math formula). But we were talking about natural
languages in their conventional orthography, using italics/bold.

/Kent K

Re: Encoding italic

2019-02-10 Thread Doug Ewell via Unicode

Egmont Koblinger wrote:

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.

As others have pointed out, I am suggesting the use of some profile of ISO 6429 
within plain text to implement these features about which there is disagreement 
whether they belong in plain text or not.

I am very definitely NOT proposing that anything be added to Unicode or 10646, 
nor that an all-new standard be created.

> There is not a well-defined framework for escape sequences.

I thought ISO 6429 defined things rather clearly, if verbosely.

> In this particular case you might say it starts with ESC [ and ends
> with the letter 'm', but how do you know where to end the sequence if
> that letter 'm' just doesn't arrive?

Well, what do you do in HTML if the closing '>' never arrives?

If it's simply a matter of the text coming to an end before the 'm' arrives, 
then it doesn't matter. If the 'm' (or other final code unit for other 
commands) is dropped but the sequence goes on, like [3This is 
italicized[m, then gosh, I don't know offhand what the standard says. It 
might be worthwhile to try looking it up, or seeing what implementations do, or 
defining it clearly in the profile.

> Terminal emulators have extremely complex tables for parsing (and
> still many of them get plenty of things wrong). It's unreasonable for
> any random small utility processing Unicode text to go into this
> business of recognizing all the well-known escape sequences, not even
> to the extent to know where they end.

Perhaps interestingly, I wrote a random small utility many years ago that 
displayed ISO 6429 sequences on a Windows console, back in the dark ages 
between ANSI.SYS and Windows 10 support for 6429. It didn't cover the entire 
standard, nor could it, but a decent subset. It understood where sequences 
ended, even unknown ones, because that is all laid out in the standard.

> Whatever is designed should be much more easily parseable. Should you
> say "everything from ESC[ to m", you'll cause a whole bunch of
> problems when a different kind of escape sequence gets interpreted as
> Unicode.

I'm afraid I don't understand this statement.

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each.

That's easy:

3 = turn on italics
0 = turn off all special styling, including italics
1 = turn on bold (or intense, whichever the output device supports)

It's a silly sequence, because why would you turn on an attribute and then 
immediately turn it off before using it? But silly though it may be, it's 
well-formed and very easy to parse. My random small utility had no problem with 
it.

> Also, it should be carefully evaluated what to do with C1 (U+009B)
> instead of the C0 ESC[ opening for an escape sequence – here terminal
> emulators vary. These just make everything even more cumbersome.

Why would they vary? CSI encoded as <1B 5B> or as <9B> is exactly the same. 
Again, this is very clear in the standard.

> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".
> It's only nowadays that most terminal emulators support 256 colors and
> some even support 16M true colors that some emulators try to push for
> this bit unambiguously meaning "bold" only, whereas in most emulators
> it means "both bold and increased intensity". [...]

Why would we expect every displayed and printed page to look identical? That's 
not going to happen no matter what encoding mechanism you use for "bold" and 
"intense" and the rest. Not all HTML pages look identical either.

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette?

Why not?

> Should Unicode go into the business

Nope. Unicode should do nothing about this.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.
> ECMA-48 doesn't say anything about it, TUI T.416 does, although it's
> absolutely not clear. See e.g. the discussion at the comment section
> of https://gist.github.com/XVilka/8346728 , in Dec 2018, we just
> couldn't figure out which syntax exactly TUI T.416 wants to say.

That sounds like someone should send a question to ITU-T. Exegesis would
perhaps be more productive than despair.

> Moreover, due to a common misinterpretation of the spec, one of the
> positional parameters are often omitted.

That's a decision designers and implementers are sometimes faced with: should 
we remain bug-compatible with other implementations, or follow the straight and 
narrow path? I remember browsers

Re: Encoding italic

2019-02-10 Thread James Kass via Unicode




Philippe Verdy wrote,

>> ...[one font file having both italic and roman]...
> The only case where it happens in real fonts is for the mapping of
> Mathematical Symbols which have a distinct encoding for some
> variants ...

William Overington made a proof-of-concept font using the VS14 character 
to access the italic glyphs which were, of course, in the same real 
font.  Which means that the developer of a font such as Deja Vu Math TeX 
Gyre could set up an OpenType table mapping the Basic Latin in the font 
to the italic math letter glyphs in the same font using the VS14 
characters.  Such a font would work interoperably on modern systems.  
Such a font would display italic letters both if encoded as math 
alphanumerics or if encoded as ASCII plus VS14.  Significantly, the 
display would be identical.


> ...[math alphanumerics]...
> These were allowed in Unicode because of their specific contextual
> use as distinctive symbols from known standards, and not for general
> use in human languages

They were encoded for interoperability and round-tripping because they 
existed in character sets such as STIX.  They remain Latin letter form 
variants.  If they had been encoded as the variant forms which 
constitute their essential identity it would have broken the character 
vs. glyph encoding model of that era.  Arguing that they must not be 
used other than for scientific purposes is just so much semantic 
quibbling in order to justify their encoding.


Suppose we started using the double struck ASCII variants on this list 
in order to note Unicode character numbers such as 핌+픽피픽픽 or 
핌+ퟚퟘퟞퟘ?  Hexadecimal notation is certainly math and Unicode can be 
considered a science.  Would that be “math abuse” if we did it?  (Is 
linguistics not a science?)


> (because these encodings are defective and don't have the necessary
> coverage, notably for the many diacritics,

The combining diacritics would be used.

> case mappings,

Adjust them as needed.

> and other linguisitic, segmentation and layout properties).
>
> The same can be said about superscript/subscript variants,
> ... : they have specific use and not made for general purpose texts ...

So people who used ISO-8859-1 were not allowed to use the superscript 
digits therein for marking footnotes?  Those superscript digits were 
reserved by ISO-8859-1 only for use by math and science?


MATHEMATICAL ITALIC CAPITAL A
Decomposition mapping:  U+0041
Binary properties:  Math, Alphabetic, Uppercase, Grapheme Base, ...

SUPERSCRIPT TWO
Decomposition mapping:  U+0032
Binary properties:  Grapheme Base

MODIFIER LETTER SMALL C
Decomposition mapping:  U+0063
Binary properties:  Alphabetic, Lowercase, Grapheme Base, ...

Re: Encoding italic

2019-02-10 Thread Philippe Verdy via Unicode

Le dim. 10 févr. 2019 à 05:34, James Kass via Unicode 
a écrit :

>
> Martin J. Dürst wrote,
>
>  >> Isn't that already the case if one uses variation sequences to choose
>  >> between Chinese and Japanese glyphs?
>  >
>  > Well, not necessarily. There's nothing prohibiting a font that includes
>  > both Chinese and Japanese glyph variants.
>
> Just as there’s nothing prohibiting a single font file from including
> both roman and italic variants of Latin characters.
>

May be but such a fint would not work as intended to display both styles
distinctly with the common use of the italic style: it would have to make a
default choice and you would then need either a special text encoding, or
enabling an OpenType feature (if using OpenType font format) to select the
other style in a non-standard custom way.

The only case where it happens in real fonts is for the mapping of
Mathematical Symbols which have a distinct encoding for some variants (only
for a basic subset of the Latin alphabet, as well as some basic Greek and a
few other letters from other scripts), and this is typically done only in
symbol fonts containing other mathametical symbols, but because of the
specific encoding for such mathematical use. As well we have the variants
registered in Unicode for IPA usage (only lowercase letters, treated as
symbols and not case-paired).

These were allowed in Unicode because of their specific contextual use as
distinctive symbols from known standards, and not for general use in human
languages (because these encodings are defective and don't have the
necessary coverage, notably for the many diacritics, case mappings, and
other linguisitic, segmentation and layout properties).

The same can be said about superscript/subscript variants, bold variants,
monospace variants: they have specific use and not made for general purpose
texts in human languages with their common orthographic conventions: Latin
is a large script and one of the most complex, and it's quite normal that
there are some deviating usages for specific purposes, provided they are
bounded in scope and use.

But what you would like is to extend the whole Latin script (and why not
Greek, Cyrillic, and others) with multiple reencodings for lot of stylistic
variants, and each time a new character or diacritic is encoded it would
have to be encoded multiple times (so you'd break the encoding character
model, and would just complicate the implementation even more, and would
also create new security issues with lot of new confusables, that every
user of Unicode would then have to take into account, and evey application
or library would then need to be updated, and have to include large
datatables to handle them).

As well it would create many conflicts if we used the "VARIATION SELECTOR
n" characters, or would need to permanently assign specific ones for
specific styles; and then rapidly we would no longer have enough "VARIATION
SELECTOR n" selectors in Unicode : we only have 256 of them, only one is
more or less permanently dedicated.

[VS16 is almos compeltely reserved now for distinction between
normal/linguisitic and emoji/colorful variants. The emoji subset in Unicode
is an open set which could expand in the future to tens of thousands
symbols, and will likely cause large work overhaed in CLDR project just to
describe them, one reason for which I think that Emoji character data in
CLDR should be separated in a distinct translation project, with its own
versioning and milestones, and not maintained in sync with the rest of CLDR
data, if we consider how emojis have flooded the CLDR survey discussions,
when this subset has many known issues and inconsistencies and still no
viable encoding model like the "character encoding model" to make it more
consistant, and updatable separately from the rest of the Unicode UCD
releases; in my opinion the emojis in Unicode are still an alpha project in
development and it's too soon to describe them as a "standard" when there
are many other possible way to handle them; these emeojis are just there
now to remlain as "legacy" mappings but won't resist an expected coming new
formal standard about them insterad of the current mess they create now.]

Re: Encoding italic

2019-02-10 Thread Rebecca Bettencourt via Unicode

On Sat, Feb 9, 2019 at 6:23 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sat, 9 Feb 2019 04:52:30 -0800
> David Starner via Unicode  wrote:
>
> > Note that this is actually the only thing that stands out to me in
> > Unicode not supporting older character sets; in PETSCII (Commodore
> > 64), the high-bit character characters were the reverse (in this
> > sense) of the low-bit characters.
>
> Later ISCII has some styling codes, bold and italic amongst them.
>

Interesting.

I found the 1991 ISCII spec: http://varamozhi.sourceforge.net/iscii91.pdf

The styling codes are:

EF 30 - Bold
EF 31 - Italic
EF 32 - Underline
EF 33 - Double Width
EF 34 - Highlight
EF 35 - Outline
EF 36 - Shadow
EF 37 - Double Height, Top Half
EF 38 - Double Height, Bottom Half
EF 39 - Double Height & Double Width

There are also codes for switching scripts (Roman, Devanagari, Bengali,
Tamil, Arabic, Persian, etc.) but these are not necessary since Unicode
encodes these separately.

These take effect "till the end of a line, or till the same attribute [code
is encountered]." In other words, these just toggle the attribute, and all
the attributes are reset when a newline is encountered.

Re: Encoding italic

2019-02-09 Thread James Kass via Unicode




Martin J. Dürst wrote,

>> Isn't that already the case if one uses variation sequences to choose
>> between Chinese and Japanese glyphs?
>
> Well, not necessarily. There's nothing prohibiting a font that includes
> both Chinese and Japanese glyph variants.

Just as there’s nothing prohibiting a single font file from including 
both roman and italic variants of Latin characters.

Re: Encoding italic

2019-02-09 Thread Martin J . Dürst via Unicode

On 2019/02/09 19:58, Richard Wordingham via Unicode wrote:
> On Fri, 8 Feb 2019 18:08:34 -0800
> Asmus Freytag via Unicode  wrote:

>> Under the implicit assumptions bandied about here, the VS approach
>> thus reveals itself as a true rich-text solution (font switching)
>> albeit realized with pseudo coding rather than markup, markdown or
>> escape sequences.
> 
> Isn't that already the case if one uses variation sequences to choose
> between Chinese and Japanese glyphs?

Well, not necessarily. There's nothing prohibiting a font that includes 
both Chinese and Japanese glyph variants.

Regards,   Martin.

Encoding colour (from Re: Encoding italic)

2019-02-09 Thread wjgo_10...@btinternet.com via Unicode


Egmont Koblinger wrote:


Should this scheme be extended for colors, too? What to do with the

legacy 8/16 as well as the 256-color extensions wrt. the color
palette? Should Unicode go into the business of defining a fixed set
of colors, or allow to alter the palette colors using the OSC 4 and
friends escape sequences which supported by about half of the terminal
emulators out there?

Encoding colour is already a topic in relation to emoji and maybe could 
be extended to other characters.


A stateful method, though which might be useful for plain text streams 
in some applications, would be to encode as characters some of the 
glyphs for indicating colours and the digit characters to go with them 
from page 5 and from page 3 of the following publication.


http://www.users.globalnet.co.uk/~ngo/locse027.pdf

What to do with things that Unicode might also want to have, but 
doesn't exist in terminal emulators due to their nature, such as

switching to a different font size?

Well, if people were to want to do it, there could be a character 
encoded in the Specials section and then use that character as a base 
character and follow it with a sequence of tag characters.


William Overington

Saturday 9 February 2019

Re: Encoding colour (from Re: Encoding italic)

2019-02-09 Thread wjgo_10...@btinternet.com via Unicode


Previously I wrote:

A stateful method, though which might be useful for plain text streams 
in some applications, would be to encode as characters some of the 
glyphs for indicating colours and the digit characters to go with them 
from page 5 and from page 3 of the following publication.



http://www.users.globalnet.co.uk/~ngo/locse027.pdf


Thinking about this further, for this application copies of the glyphs 
could be redesigned so as to be square and could be emoji-style and the 
meanings of the characters specifying which colour component is to be 
set could be changed so that they refer to the number previously entered 
using one or more  of the special  digit characters. Thus the setting of 
colour components could be done in the same reverse notation way that 
the FORTH computer language works. Yet although the colour components 
thus set would be stateful until changed there would be no Escape 
sequence and if an application did not support interpretation of the 
characters as setting colours, they would just be displayed as glyphs, 
each either as a particular glyph or as a .notdef glyph.


William Overington
Saturday 9 February 2019

Re: Encoding italic

2019-02-09 Thread Richard Wordingham via Unicode

On Sat, 9 Feb 2019 04:52:30 -0800
David Starner via Unicode  wrote:

> Note that this is actually the only thing that stands out to me in
> Unicode not supporting older character sets; in PETSCII (Commodore
> 64), the high-bit character characters were the reverse (in this
> sense) of the low-bit characters.

Later ISCII has some styling codes, bold and italic amongst them.

Richard.

Re: Encoding italic

2019-02-09 Thread Rebecca Bettencourt via Unicode

On Sat, Feb 9, 2019 at 4:58 AM David Starner via Unicode <
unicode@unicode.org> wrote:

>
> On Sat, Feb 9, 2019 at 3:59 AM Kent Karlsson via Unicode <
> unicode@unicode.org> wrote:
>
>>
>> Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" > >:
>> > • Reverse on: ESC [7m
>> > • Reverse off: ESC [27m
>>
>> "Reverse" = "switch background and foreground colours".
>>
>> This is an (odd) colour thing. If you want to go with (full!) colour
>> (foreground and background), fine, but the "reverse" is oddball (and
>> based on what really old terminals were limited to when it comes to
>> colour).
>>
>
> Note that this is actually the only thing that stands out to me in Unicode
> not supporting older character sets; in PETSCII (Commodore 64), the
> high-bit character characters were the reverse (in this sense) of the
> low-bit characters.
>

This is true, many legacy character sets encoded reverse-video characters
as wholly-separate characters, and even allowed them in contexts widely
considered plain-text such as file names. This makes reverse-video possibly
the one text attribute best argued to be worthy of encoding in Unicode. But
I can already tell you it won't work, because we made such an argument in
an early version of L2/19-025, and even proposed using VS14, the very same
VS William Overington has since swiped from us for italics. That proposal
was shot down rather quickly. Bold, italics, etc. don't even stand a chance.

Re: Encoding italic

2019-02-09 Thread David Starner via Unicode

On Sat, Feb 9, 2019 at 3:59 AM Kent Karlsson via Unicode <
unicode@unicode.org> wrote:

>
> Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode"  >:
> > • Reverse on: ESC [7m
> > • Reverse off: ESC [27m
>
> "Reverse" = "switch background and foreground colours".
>
> This is an (odd) colour thing. If you want to go with (full!) colour
> (foreground and background), fine, but the "reverse" is oddball (and
> based on what really old terminals were limited to when it comes to
> colour).
>

Note that this is actually the only thing that stands out to me in Unicode
not supporting older character sets; in PETSCII (Commodore 64), the
high-bit character characters were the reverse (in this sense) of the
low-bit characters.

Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode


Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" :

> I'd like to propose encoding italics and similar display attributes in
> plain text using the following stateful mechanism:

Note that these do NOT nest (no stack...), just state changes for the
relevant PART of the "graphic" (i.e. style) state. So the approach in
that regard is quite different from the approach done in HTML/CSS.

>  Italics on: ESC [3m
>  Italics off: ESC [23m
>  Bold on: ESC [1m
>  Bold off: ESC [22m
>  Underline on: ESC [4m
(implies turning double underline off)

   Underline, double: ESC [21m
(implies turning single underline off)

>  Underline off: ESC [24m
>  Strikethrough on: ESC [9m
>  Strikethrough off: ESC [29m
>  Reverse on: ESC [7m
>  Reverse off: ESC [27m

"Reverse" = "switch background and foreground colours".

This is an (odd) colour thing. If you want to go with (full!) colour
(foreground and background), fine, but the "reverse" is oddball (and
based on what really old terminals were limited to when it comes to colour).

I'd rather include 'ESC [50m' (not variable spacing, i.e. "monospace" font)
and 'ESC [26m' (variable spacing, i.e. "proportional" font). Recall that
this is NOT for terminal emulators but for styling applied to text
outside of terminal emulators. (Terminal emulators already implement
much of this and more; albeit sometimes wrongly). This would be handy
for including (say) programming code or computer commands (or for that
matter, "ASCII art", or more generally "Unicode art") in otherwise
"ordinary"
text... (The "ordinary" text preferably set in a proportional font.)

>  Reset all attributes: ESC [m

(Actually 'ESC [0m', with the 0 default-able.) Handy, agreed, but not 100%
necessary.
These ESC-sequences should not normally be inserted "manually" but by a text
editor program, using the conventional means of "making bold" etc. (ctrl-b,
cmd-b,
"bold" in a menu); only "hackers" (in the positive sense) would actually
bother
about the command sequences as such.

/Kent K


> where ESC is U+001B.
>  
> This mechanism has existed for around 40 years and is already supported
> as widely as any new Unicode-only convention will ever be.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>  
>

Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode



Den 2019-02-08 22:29, skrev "Egmont Koblinger via Unicode"
:

> (Mind you, I don't find it a good idea to add italic and whatnot
> formatting support to Unicode at all... but let's put aside that now.)

I don't think Doug mean to "add it to the Unicode standard", just to
have a summary of "handy esc-sequences (actually command-sequences)
for simple styling of text" picked from long-standing (text level...)
standards.

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.
> 
> There is not a well-defined framework for escape sequences. In this
> particular case you might say it starts with ESC [ and ends with the
> letter 'm', but how do you know where to end the sequence if that
> letter 'm' just doesn't arrive? Terminal emulators have extremely

There is an overriding "basic (overall) syntax" for esc-seq/
command-sequences that do not include a string argument (like OSC,
APC, ...). IIUC it is (originally as byte sequences, but here as
character sequences):

\u001B[\u0020-\002F]*[\u0030-\007E]| 
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] 

(no newline or carriage return in there). True, that has no direct
limit, but it would not be unreasonable to set a limit of (say)
max 30 characters. Potential (i.e. starting with ESC) esc-"sequences"
that do not match the overall syntax or are too long can simply be
rendered as is (except for the ESC itself). The esc/command sequences
(that match) but are not interpreted should be ignored in "normal"
(not "show invisibles" mode) display.

They are unlikely to be "default ignored" by such things as sorting
(and should preferably be filtered out beforehand, if possible). But
if we compare to other rich text editors, the command sequences should
be ignored by (interactive) searching, just like HTML tags are ignored
in interactive searching (the internal representation "skipping" the
HTML tags in one way or another). HTML tags should also (when text
known to be HTLM) filtered out before doing such things as sorting.

> complex tables for parsing (and still many of them get plenty of
> things wrong). It's unreasonable for any random small utility
> processing Unicode text to go into this business of recognizing all
> the well-known escape sequences, not even to the extent to know where
> they end. Whatever is designed should be much more easily parseable.
> Should you say "everything from ESC[ to m", you'll cause a whole bunch
> of problems when a different kind of escape sequence gets interpreted
> as Unicode.

The escape/command sequences would not be part of Unicode (standard).

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each. Also, it should be

Formally covered by the (non-Unicode) standards, but optional (IIUC).

> carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[
> opening for an escape sequence  here terminal emulators vary. These
> just make everything even more cumbersome.
> 
> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".

I think one should interpret these in a "modern" way, not looking
too much at what old terminals were limited to. (Colour ("increased
intensity") should be handled completely separately from bold.)

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette? Should Unicode go into the business of defining a fixed set
> of colors, or allow to alter the palette colors using the OSC 4 and
> friends escape sequences which supported by about half of the terminal
> emulators out there?

IF extending to colour, only refer to "true colour" (RGB) command-sequence.
The colour palette versions are for the limitations of (semi-)old terminals.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.

It can only be colon. Using semicolon would interfere with the syntax
for multiple style specifications in one command sequence. (I by mistake
wrote a semicolon there in an earlier post; sorry.)

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them? Where to draw the line what

(Note colon, not semicolon, as separator.) Possible, partially matching
the capabilities for underlining via CSS (solid, dotted, dashed, wavy,
double). Depends on how much styling options one wants to pick up.

> to add to Unicode and what not to? Will Unicode possibly be a

I don't think anyone wants to make this part of the Unicode standard.
(A the most a Unicode technical note...; from Unicode's point of view.)

[...] 
> What to do with things that Unicode might also want to have, but
> doesn't exist in terminal emulators due to their nature, such as
> switching

Re: Encoding italic

2019-02-09 Thread Richard Wordingham via Unicode

On Fri, 8 Feb 2019 18:08:34 -0800
Asmus Freytag via Unicode  wrote:

> On 2/8/2019 5:42 PM, James Kass via Unicode wrote:

> You are still making the assumption that selecting a different glyph
> for the base character would automatically lead to the selection of a
> different glyph for the combining mark that follows. That's an iffy
> assumption because "italics" can be realized by choosing a separate
> font (typographically, italics is realized as a separate typeface).

The usual practice is to look for a font that supports both base
character and mark.

> Under the implicit assumptions bandied about here, the VS approach
> thus reveals itself as a true rich-text solution (font switching)
> albeit realized with pseudo coding rather than markup, markdown or
> escape sequences.

Isn't that already the case if one uses variation sequences to choose
between Chinese and Japanese glyphs?

>> Of course, the user might insert VS14s without application
>> assistance.  In which case hopefully the user knows the rules.  The
>> worst case scenario is where the user might insert a VS14 after a
>> non-base character, in which case it should simply be ignored by any
>> application.  It should never “break” the display or the processing;
>> it simply makes the text for that document non-conformant.  (Of
>> course putting a VS14 after “ê” should not result in an italicized
>> “ê”.)

Is there any obligation on applications to ignore it?  In plain text,
the Unicode rules allow the application to choose to render every third
'ê' as italic.  Possibly it comes down to the mens rea of the
application (or of its coder or specifier), but without mentalism an
application could opt to treat <ê, VS14> as .

A relevant concern would be 'voracious' with the first 'o'
italicised by VS14.  How would current typeface selection logic work?
I can envisage  only being in the cmap of an italic font.

Richard.

Re: Encoding italic

2019-02-08 Thread James Kass via Unicode




Asmus Freytag wrote,

> You are still making the assumption that selecting a different glyph for
> the base character would automatically lead to the selection of a 
different

> glyph for the combining mark that follows. That's an iffy assumption
> because "italics" can be realized by choosing a separate font 
(typographically,

> italics is realized as a separate typeface).
>
> There's no such assumption built into the definition of a VS. At 
best, inside
> the same font, there may be an implied ligature, but that does not 
work if

> there's an underlying font switch.

Midstream font switching isn’t a user option in most plain-text 
applications, although there can be some font substitution happening at 
the OS level.  Any combining mark must apply to its base letter glyph, 
even after a base letter glyph has been modified.


More sophisticated editors, like BabelPad, allow users to select 
different fonts for different ranges of Unicode.  If a user selects font 
X for ASCII and font Y for combining marks, then mark positioning is 
already broken.


If the user selects Times New Roman for both ASCII and combining marks, 
then no font switching is involved.  The Times New Roman type face 
includes italic letter form variants.  Any application sharp enough to 
know that the italic letter form variants are stored in a different 
computer *file* should be clever enough to apply mark positioning 
accordingly.  And any single font file which includes italic letters and 
maps them with VS14 would avoid any such issues altogether.

Re: Encoding italic

2019-02-08 Thread Asmus Freytag via Unicode


  
  
On 2/8/2019 5:42 PM, James Kass via
  Unicode wrote:


  
  William,
  
  
  Rather than having the user insert the VS14 after every character,
  the editor might allow the user to select a span of text for
  italicization.  Then it would be up to the editor/app to insert
  the VS14s where appropriate.
  
  
  For Andrew’s example of “fête”, the user would either type the
  string:
  
  “f” + “ê” + “t” + “e”
  
  or the string:
  
  “f” + “e” +  + “t” +
  “e”.
  
  
  If the latter, the application would insert VS14 characters after
  the “f”, “e”, “t”, and “e”.  The application would not insert a
  VS14 after the combining circumflex — because the specification
  does not allow VS characters after combining marks, they may only
  be used on base characters.
  
  
  In the first ‘spelling’, since the specifications forbid VS
  characters after any character which is not a base character (in
  other words, not after any character which has a decomposition,
  such as “ê”) — the application would first need to convert the
  string to the second ‘spelling’, and proceed as above.  This is
  known as converting to NFD.
  
  
  So in order for VS14 to be a viable approach, any application
  would ① need to convert any selected span to NFD, and ② only
  insert VS14 after each base character.  And those are two
  operations which are quite possible, although they do add slightly
  to the programmer’s burden.  I don’t think it’s a “deal-killer”.
  



You are still making the assumption that selecting a different
  glyph for the base character would automatically lead to the
  selection of a different glyph for the combining mark that
  follows. That's an iffy assumption because "italics" can be
  realized by choosing a separate font (typographically, italics is
  realized as a separate typeface).
There's no such assumption built into the definition of a VS. At
  best, inside the same font, there may be an implied ligature, but
  that does not work if there's an underlying font switch.
Under the implicit assumptions bandied about here, the VS
  approach thus reveals itself as a true rich-text solution (font
  switching) albeit realized with pseudo coding rather than markup,
  markdown or escape sequences.
It's definitely no more "plain text" than HTML source code.

A./


  
  Of course, the user might insert VS14s without application
  assistance.  In which case hopefully the user knows the rules. 
  The worst case scenario is where the user might insert a VS14
  after a non-base character, in which case it should simply be
  ignored by any application.  It should never “break” the display
  or the processing; it simply makes the text for that document
  non-conformant.  (Of course putting a VS14 after “ê” should not
  result in an italicized “ê”.)
  
  
  Cheers,
  
  
  James

Re: Encoding italic

2019-02-08 Thread James Kass via Unicode




William,

Rather than having the user insert the VS14 after every character, the 
editor might allow the user to select a span of text for italicization.  
Then it would be up to the editor/app to insert the VS14s where appropriate.


For Andrew’s example of “fête”, the user would either type the string:
“f” + “ê” + “t” + “e”
or the string:
“f” + “e” +  + “t” + “e”.

If the latter, the application would insert VS14 characters after the 
“f”, “e”, “t”, and “e”.  The application would not insert a VS14 after 
the combining circumflex — because the specification does not allow VS 
characters after combining marks, they may only be used on base characters.


In the first ‘spelling’, since the specifications forbid VS characters 
after any character which is not a base character (in other words, not 
after any character which has a decomposition, such as “ê”) — the 
application would first need to convert the string to the second 
‘spelling’, and proceed as above.  This is known as converting to NFD.


So in order for VS14 to be a viable approach, any application would ① 
need to convert any selected span to NFD, and ② only insert VS14 after 
each base character.  And those are two operations which are quite 
possible, although they do add slightly to the programmer’s burden.  I 
don’t think it’s a “deal-killer”.


Of course, the user might insert VS14s without application assistance.  
In which case hopefully the user knows the rules.  The worst case 
scenario is where the user might insert a VS14 after a non-base 
character, in which case it should simply be ignored by any 
application.  It should never “break” the display or the processing; it 
simply makes the text for that document non-conformant.  (Of course 
putting a VS14 after “ê” should not result in an italicized “ê”.)


Cheers,

James

Re: Encoding italic

2019-02-08 Thread Richard Wordingham via Unicode

On Fri, 8 Feb 2019 14:26:28 -0800
Asmus Freytag via Unicode  wrote:

> On 2/8/2019 2:08 PM, Richard Wordingham via Unicode wrote:
> On Fri, 8 Feb 2019 17:16:09 + (GMT)
> "wjgo_10...@btinternet.com via Unicode"  wrote:
> 
> Andrew West wrote:
> 
> Just reminding you that "The initial character in a variation
> sequence  
> is never a nonspacing combining mark (gc=Mn) or a canonical
> decomposable character" (The Unicode Standard 11.0 §23.4).
> 
> Hopefully the issue that Andrew mentions can be resolved in some way.
> 
> This is not a problem.  Instead of writing <ê, VS14>, one just writes
> .
> 
> And  introducing yet another convention, which is that combining
> marks inherit the font of the base character.
> 
> Remember, italics, even though presented as a boolean attribute in
> most UIs is in fact typographically a font selection.

Wouldn't  be the base character for the selection of the
font?

Richard.

Re: Encoding italic

2019-02-08 Thread Asmus Freytag via Unicode


  
  
On 2/8/2019 2:08 PM, Richard Wordingham
  via Unicode wrote:


  On Fri, 8 Feb 2019 17:16:09 + (GMT)
"wjgo_10...@btinternet.com via Unicode"  wrote:


  
Andrew West wrote:

  
  

  

  Just reminding you that "The initial character in a variation
sequence  
is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4).


  
  

  
Hopefully the issue that Andrew mentions can be resolved in some way.

  
  
This is not a problem.  Instead of writing <ê, VS14>, one just writes
.

And  introducing yet another convention, which is that
  combining marks inherit the font of the base character.
Remember, italics, even though presented as a boolean attribute
  in most UIs is in fact typographically a font selection.

A./




  

Richard.

Re: Encoding italic

2019-02-08 Thread Richard Wordingham via Unicode

On Fri, 8 Feb 2019 22:29:57 +0100
Egmont Koblinger via Unicode  wrote:

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them? Where to draw the line what
> to add to Unicode and what not to? Will Unicode possibly be a
> bottleneck of further improvements in terminal emulators, because from
> now on every new mode we figure out we'd like to have in terminals
> should go through some Unicode committee? And what if Unicode wants to
> have a mode that terminal emulators aren't interested in, who will
> assign numbers to them that don't clash with terminals? Who will
> somehow keep the two worlds in sync?

Escape sequences are outside the scope of Unicode.  They are part of a
higher level protocol (TUS 23.1 'Control codes').

Richard.

Re: Encoding italic

2019-02-08 Thread Richard Wordingham via Unicode

On Fri, 8 Feb 2019 17:16:09 + (GMT)
"wjgo_10...@btinternet.com via Unicode"  wrote:

> Andrew West wrote:

>> Just reminding you that "The initial character in a variation
>> sequence  
>> is never a nonspacing combining mark (gc=Mn) or a canonical
>> decomposable character" (The Unicode Standard 11.0 §23.4).

> Hopefully the issue that Andrew mentions can be resolved in some way.

This is not a problem.  Instead of writing <ê, VS14>, one just writes
.

Richard.

Re: Encoding italic

2019-02-08 Thread Egmont Koblinger via Unicode

Hi guys,

Having been a terminal emulator developer for some years now, I have
to say – perhaps surprisingly – that I don't fancy the idea of reusing
escape sequences of the terminal world.

(Mind you, I don't find it a good idea to add italic and whatnot
formatting support to Unicode at all... but let's put aside that now.)

There are a lot of problems with these escape sequences, and if you go
for a potentially new standard, you might not want to carry these
problems.

There is not a well-defined framework for escape sequences. In this
particular case you might say it starts with ESC [ and ends with the
letter 'm', but how do you know where to end the sequence if that
letter 'm' just doesn't arrive? Terminal emulators have extremely
complex tables for parsing (and still many of them get plenty of
things wrong). It's unreasonable for any random small utility
processing Unicode text to go into this business of recognizing all
the well-known escape sequences, not even to the extent to know where
they end. Whatever is designed should be much more easily parseable.
Should you say "everything from ESC[ to m", you'll cause a whole bunch
of problems when a different kind of escape sequence gets interpreted
as Unicode.

A parser, by the way, would also have to interpret combined sequences
like ESC[3;0;1m or alike, for which I don't see a good reason as
opposed to having separate sequences for each. Also, it should be
carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[
opening for an escape sequence – here terminal emulators vary. These
just make everything even more cumbersome.

ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".
It's only nowadays that most terminal emulators support 256 colors and
some even support 16M true colors that some emulators try to push for
this bit unambiguously meaning "bold" only, whereas in most emulators
it means "both bold and increased intensity". Because of compatibility
reason, it won't be a smooth switch. Note that "bold" and "increased
intensity" only go in the same direction with white-on-black color
scheme, with black-on-white bold stands out more while increased
intensity (a lighter shade of gray instead of black) stands out less.
(We could also start nitpicking that the spec doesn't even say that
increased intensity is just for the foreground and not for the
background too.)

Should this scheme be extended for colors, too? What to do with the
legacy 8/16 as well as the 256-color extensions wrt. the color
palette? Should Unicode go into the business of defining a fixed set
of colors, or allow to alter the palette colors using the OSC 4 and
friends escape sequences which supported by about half of the terminal
emulators out there?

For 256-colors and truecolors, there are two or three syntaxes out
there regarding whether the separator is a colon or a semicolon.
ECMA-48 doesn't say anything about it, TUI T.416 does, although it's
absolutely not clear. See e.g. the discussion at the comment section
of https://gist.github.com/XVilka/8346728 , in Dec 2018, we just
couldn't figure out which syntax exactly TUI T.416 wants to say.
Moreover, due to a common misinterpretation of the spec, one of the
positional parameters are often omitted.

Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
for curly underline. What to do with them? Where to draw the line what
to add to Unicode and what not to? Will Unicode possibly be a
bottleneck of further improvements in terminal emulators, because from
now on every new mode we figure out we'd like to have in terminals
should go through some Unicode committee? And what if Unicode wants to
have a mode that terminal emulators aren't interested in, who will
assign numbers to them that don't clash with terminals? Who will
somehow keep the two worlds in sync?

What to do with things that Unicode might also want to have, but
doesn't exist in terminal emulators due to their nature, such as
switching to a different font size?

> This mechanism [...] is already supported
> as widely as any new Unicode-only convention will ever be.

I truly doubt this, these escape sequences are specific to terminal
emulation, an extremely narrow subset of where Unicode is used and
rich text might be desired.

I see it a much more viable approach if Unicode goes for something
brand new, something clean, easily parseable, and it remains the job
of specific applications to serve as a bridge between the two worlds.
Or, if it wants to adopt some already existing technology, I find
HTML/CSS a much better starting point.

regards,
egmont

On Fri, Feb 8, 2019 at 9:55 PM Doug Ewell via Unicode
 wrote:
>
> I'd like to propose encoding italics and similar display attributes in
> plain text using the following stateful mechanism:
>
> •   Italics on: ESC [3m
> •   Italics off: ESC [23m
> •   Bold on: ESC [1m
> •

Re: Encoding italic

2019-02-08 Thread Rebecca Bettencourt via Unicode

+∞

-- Rebecca Bettencourt


On Fri, Feb 8, 2019 at 12:55 PM Doug Ewell via Unicode 
wrote:

> I'd like to propose encoding italics and similar display attributes in
> plain text using the following stateful mechanism:
>
> •   Italics on: ESC [3m
> •   Italics off: ESC [23m
> •   Bold on: ESC [1m
> •   Bold off: ESC [22m
> •   Underline on: ESC [4m
> •   Underline off: ESC [24m
> •   Strikethrough on: ESC [9m
> •   Strikethrough off: ESC [29m
> •   Reverse on: ESC [7m
> •   Reverse off: ESC [27m
> •   Reset all attributes: ESC [m
>
> where ESC is U+001B.
>
> This mechanism has existed for around 40 years and is already supported
> as widely as any new Unicode-only convention will ever be.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Encoding italic

2019-02-08 Thread Doug Ewell via Unicode

I'd like to propose encoding italics and similar display attributes in
plain text using the following stateful mechanism:
 
•   Italics on: ESC [3m
•   Italics off: ESC [23m
•   Bold on: ESC [1m
•   Bold off: ESC [22m
•   Underline on: ESC [4m
•   Underline off: ESC [24m
•   Strikethrough on: ESC [9m
•   Strikethrough off: ESC [29m
•   Reverse on: ESC [7m
•   Reverse off: ESC [27m
•   Reset all attributes: ESC [m
 
where ESC is U+001B.
 
This mechanism has existed for around 40 years and is already supported
as widely as any new Unicode-only convention will ever be.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-02-08 Thread wjgo_10...@btinternet.com via Unicode


Andrew West wrote:


Just reminding you that "The initial character in a variation sequence

is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4). This means
that a variation sequence cannot be defined for any precomposed
letters and diacritics, so for example you could not italicize the
word "fête" by simply adding VS14 after each letter because "ê" (in
NFC form) cannot act as the base for a variation sequence. You would
have to first convert any text to be italicized to NFD, then apply
VS14 to each non-combining character. This alone would make a VS
solution unacceptable in my opinion.

As it happens I was not aware of that before, and in fact I had already 
produced a PDF document for submission to the Unicode Technical 
Committee when I read your post.


https://www.unicode.org/L2/L2019/19063-italic-vs.pdf

So, it is an issue that needs to be resolved.

I am a researcher and I am looking for the best way to do this so as to 
get a good result that people can use, I am not trying to assert that my 
suggestion is necessarily the best way to do it. For example, I accepted 
the suggestion that James made.  The meeting of the Unicode Technical 
Committee is not due until April and hopefully some other people will 
send in documents and comments on the topic.


Hopefully the issue that Andrew mentions can be resolved in some way.

William Overington
Friday 8 February 2019

Re: Encoding italic

2019-02-05 Thread Richard Wordingham via Unicode

On Tue, 5 Feb 2019 16:01:41 +
Andrew West via Unicode  wrote:

> You would
> have to first convert any text to be italicized to NFD, then apply
> VS14 to each non-combining character. This alone would make a VS
> solution unacceptable in my opinion.

What is so unacceptable about having to do this?

Richard.

Re: Encoding italic

2019-02-05 Thread Andrew West via Unicode

On Tue, 5 Feb 2019 at 15:34, wjgo_10...@btinternet.com via Unicode
 wrote:
>
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

Just reminding you that "The initial character in a variation sequence
is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4). This means
that a variation sequence cannot be defined for any precomposed
letters and diacritics, so for example you could not italicize the
word "fête" by simply adding VS14 after each letter because "ê" (in
NFC form) cannot act as the base for a variation sequence. You would
have to first convert any text to be italicized to NFD, then apply
VS14 to each non-combining character. This alone would make a VS
solution unacceptable in my opinion.

Andrew

Re: Encoding italic

2019-02-05 Thread wjgo_10...@btinternet.com via Unicode


James Kass wrote:

William’s suggestion of floating a proposal for handling italics with 
VS14 might be an example of the old saying about “putting the cart 
before the horse”.


Well, a proposal just about using VS14 to indicate a request for an 
italic version of a glyph in plain text, including a suggestion of to 
which characters it could apply, would test whether such a proposal 
would be accepted to go into the Document Register for the Unicode 
Technical Committee to consider or just be deemed out of scope and 
rejected and not considered by the Unicode Technical Committee.


If the proposal were allowed to become included in the Document Register 
of the Unicode Technical Committee then if other people wish to submit 
comments and other proposals then that would be possible as it would 
have become established that such a topic is deemed acceptable for 
placing into the Document Register of the Unicode Technical Committee.


William Overington
Tuesday 5 February 2019

Re: Encoding italic

2019-02-05 Thread James Kass via Unicode




William Overington wrote,

> Well, a proposal just about using VS14 to indicate a request for an
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

As long as “italics in plain-text” is considered out-of-scope by 
Unicode, any proposal for handling italics in plain-text would probably 
be considered out-of-scope, as well.  But I could be wrong and wouldn’t 
mind seeing a proposal.

Re: Encoding italic

2019-02-04 Thread James Kass via Unicode




Philippe Verdy responded to William Overington,

> the proposal would contradict the goals of variation selectors and would
> pollute ther variation sequences registry (possibly even creating 
conflicts).
> And if we admit it for italics, than another VSn will be dedicated to 
bold,

> and another for monospace, and finally many would follow for various
> style modifiers.
> Finally we would no longer have enough variation selectors for all 
requests).


There are 256 variation selector characters.  Any use of variation 
sequences not registered by Unicode would be non-conformant.


William’s suggestion of floating a proposal for handling italics with 
VS14 might be an example of the old saying about “putting the cart 
before the horse”.  Any preliminary proposal would first have to clear 
the hurdle of the propriety of handling italic information at the 
plain-text level.  Such a proposal might list various approaches for 
accomplishing that, if that hurdle can be surmounted.

Re: Encoding italic

2019-02-01 Thread Philippe Verdy via Unicode

the proposal would contradict the goals of variation selectors and would
pollute ther variation sequences registry (possibly even creating
conflicts). And if we admit it for italics, than another VSn will be
dedicated to bold, and another for monospace, and finally many would follow
for various style modifiers.
Finally we would no longer have enough variation selectors for all
requests).
And what we would have made was only trying to reproduce another existing
styling standard, but very inefficiently (and this use wil be "abused" for
all usages, creating new implementation constraints and contradicting goals
with existing styling languages: they would then decide to make these
characters incompatible for use in conforming applications. The Unicode
encoding would have lost all its interest.
I do not support the idea of encoding generic styles (applicable to more
than 100k+ existing characters) using variation selectors. Their goal is
only to allow semantic distinctions when two glyphs were unified in one
language may occasionnaly (not always) have some significance in specific
languages. But what you propose would apply to all languages, all scripts,
and would definitely reserve some the the few existing VSn for this styling
use, blocking further registration of needed distinctions (VSn characters
are notably needed for sinographic scripts to properly represent toponyms
or person names, or to solve some problems existing with generic character
properties in Unicode that cannot be changed because of stability rules).


Le jeu. 31 janv. 2019 à 16:32, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> Is the way to try to resolve this for a proposal document to be produced
> for using Variation Selector 14 in order to produce italics and for the
> proposal document to be submitted to the Unicode Technical Committee?
>
> If the proposal is allowed to go to the committee rather than being
> ruled out of scope, then we can know whether the Unicode Technical
> Committee will allow the encoding.
>
> William Overington
>
> Thursday 31 January 2019
>
>

Re: Encoding italic

2019-02-01 Thread James Kass via Unicode




On 2019-01-31 3:18 PM, Adam Borowski via Unicode wrote:

> They're only from a spammer's point of view.

Spammers need love, too.  They’re just not entitled to any.

Re: Encoding italic

2019-01-31 Thread Asmus Freytag via Unicode


  
  
On 1/31/2019 12:55 AM, Tex via Unicode
  wrote:


  As with the many problems with walls not being effective, you choose to ignore the legitimate issues pointed out on the list with the lack of italic standardization for Chinese braille, text to voice readers, etc.
The choice of plain text isn't always voluntary. And the existing alternatives, like math italic characters, are problematic.

The underlying issue is the lack of rich
text support in places where users expect rich text.
The solution is to find ways to enable rich
text layers that are not full documents and make them
interoperable.
The solution is not to push this into plain
text - which then becomes lowest common denominator rich text
instead.
A./

RE: Encoding italic

2019-01-31 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
> sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
> is implemented in Cygwin (sorry for mentioning a product name).)

Fair enough. This thread is mostly about italics and bold and such, not
colors, but the point is well taken that one of these leads invariably
to the others, especially if the standard or flavor in question
implements them.

> ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
> traditionally does not use bold or italic.

But that's OK. For low-level mechanisms like these, it should be
incumbent on the user to say, "Yes, I can use this styling with that
script, but I shouldn't; it would look terrible and would fly in the
face of convention." ISO 6429 also allows green text on a cyan
background, which is about as good an idea as CJK italics.

> Compare those specified for CSS
> (https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
> https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
> These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
> be of interest for the generalised subject of this thread.

I'm hoping we can continue to restrict this thread to plain text.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-31 Thread Adam Borowski via Unicode

On Thu, Jan 31, 2019 at 02:21:40PM +, James Kass via Unicode wrote:
> David Starner wrote,
> > The choice of using single-byte character sets isn't always voluntary.
> > That's why we should use ISO-2022, not Unicode. Or we can expect
> > people to fix their systems. What systems are we talking about, that
> > support Unicode but compel you to use plain text? The use of Twitter
> > is surely voluntary.
> 
> This marketing-related web page,
> 
> https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important
> 
> ...lists various reasons for using plain-text e-mail.

They're only from a spammer's point of view.

> Besides marketing, there’s also newsletters and e-mail discussion groups. 
> Some of those discussion groups are probably scholarly. Anyone involved in
> that would likely embrace ‘super cool Unicode text magic’ and it’s
> surprising if none of them have stumbled across the math alphanumerics yet.

Then there are technical mailing lists.  In particular, on every single list
other than Unicode I'm subscribed to, a HTML-only mail would get you flamed
by several list members; even a plain+HTML alternative can get you an
earful.

Then there's LKML and other lists hosted at vger, where a mail that as much
as has a HTML version attached will get outright rejected at mail software
level.

After 2½ decades of participating mailing in mailing lists, I got aversion
to HTML mails burned in as a kind of involuntary reflex.  Upon seeing Asmus'
mails, the ingrained reflex kicks in, I start getting upset, only to realize
what list I'm reading and that it's him who's a regular here, not me.

So even when in principle adding such features would be possible, many
communities decide to prefer interoperability over newest types of bling.
Some prefer top-posted HTML mails, some prefer Twitter, some Unicode plain
text, some perhaps want plain ASCII only.

> It’s true that people don’t have to use Twitter.  People don’t have to turn
> on their computers, either.

And sometimes they use a Braille reader or a text console.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄

Re: Encoding italic

2019-01-31 Thread wjgo_10...@btinternet.com via Unicode

Is the way to try to resolve this for a proposal document to be produced 
for using Variation Selector 14 in order to produce italics and for the 
proposal document to be submitted to the Unicode Technical Committee?


If the proposal is allowed to go to the committee rather than being 
ruled out of scope, then we can know whether the Unicode Technical 
Committee will allow the encoding.


William Overington

Thursday 31 January 2019

Re: Encoding italic

2019-01-31 Thread James Kass via Unicode




David Starner wrote,

> The choice of using single-byte character sets isn't always voluntary.
> That's why we should use ISO-2022, not Unicode. Or we can expect
> people to fix their systems. What systems are we talking about, that
> support Unicode but compel you to use plain text? The use of Twitter
> is surely voluntary.

This marketing-related web page,

https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important

...lists various reasons for using plain-text e-mail.  Here’s an excerpt.

“Some people simply prefer it. Plain and simple—some people prefer text 
emails. ... Some users may also see HTML emails as a security and 
privacy risk, and choose not to load any images and have visibility over 
all links that are included in an email. In addition, the increased 
bandwidth that image-heavy emails tend to consume is another driver of 
why users simply prefer plain-text emails.”


Besides marketing, there’s also newsletters and e-mail discussion 
groups.  Some of those discussion groups are probably scholarly. Anyone 
involved in that would likely embrace ‘super cool Unicode text magic’ 
and it’s surprising if none of them have stumbled across the math 
alphanumerics yet.


A web search for the string “plain text only” leads to all manner of 
applications for which searchers are trying to control their 
environments.  There’s all kinds of reasons why some people prefer to 
use plain-text, it’s often an informed choice and it isn’t limited to 
e-mail.


It’s true that people don’t have to use Twitter.  People don’t have to 
turn on their computers, either.

Re: Encoding italic

2019-01-31 Thread James Kass via Unicode




David Starner wrote,

> Emoji, as have been pointed out several times, were in the original
> Unicode standard and date back to the 1980s; the first DOS character
> page has similes at 0x01 and 0x02.

That's disingenuous.

Re: Encoding italic

2019-01-31 Thread David Starner via Unicode

On Thu, Jan 31, 2019 at 12:56 AM Tex  wrote:
>
> David,
>
> "italics has never been considered part of plain text and has always been 
> considered outside of plain text. "
>
> Time to change the definition if that is what is holding you back.

That's not a definition; that's a fact. Again, it's like the 8-bit
byte; there are systems with other sizes of byte, but you usually
shouldn't worry about it. Building systems that don't have 8-bit bytes
are possible, but it's likely to cost more than it's worth.

> As has been said before, interlinear annotation, emoji and other features of 
> Unicode which  are now considered plain text were not in the original 
> definition.

https://www.w3.org/TR/unicode-xml/#Interlinear (which used to be
Unicode Technical Report #20) says "The interlinear annotation
characters were included in Unicode only in order to reserve code
points for very frequent application-internal use. ... Including
interlinear annotation characters in marked-up text does not work
because the additional formatting information (how to position the
annotation,...) is not available. ... The interlinear annotation
characters are also problematic when used in plain text, and are not
intended for that purpose."

Emoji, as have been pointed out several times, were in the original
Unicode standard and date back to the 1980s; the first DOS character
page has similes at 0x01 and 0x02.

> If Unicode encoded an italic mechanism it would be part of plain text, just 
> as the many other styled spaces, dashes and other characters have become 
> plain text despite being typographic.

If Unicode encoded an italic mechanism, then some "plain text" would
include italics. Maybe it would be successful, and maybe it would join
the interlinear annotation characters as another discouraged poorly
supported feature.

> As with the many problems with walls not being effective, you choose to 
> ignore the legitimate issues pointed out on the list with the lack of italic 
> standardization for Chinese braille, text to voice readers, etc.

Text to voice readers don't have problems with the lack of italic
standardization; they have problems with people using mathematical
characters instead of actual letters.

> The choice of plain text isn't always voluntary.

The choice of using single-byte character sets isn't always voluntary.
That's why we should use ISO-2022, not Unicode. Or we can expect
people to fix their systems. What systems are we talking about, that
support Unicode but compel you to use plain text? The use of Twitter
is surely voluntary.

-- 
Kie ekzistas vivo, ekzistas espero.

RE: Encoding italic

2019-01-31 Thread Tex via Unicode

David,

"italics has never been considered part of plain text and has always been 
considered outside of plain text. "

Time to change the definition if that is what is holding you back. As has been 
said before, interlinear annotation, emoji and other features of Unicode which  
are now considered plain text were not in the original definition. If Unicode 
encoded an italic mechanism it would be part of plain text, just as the many 
other styled spaces, dashes and other characters have become plain text despite 
being typographic.

"The fact that italics can be handled elsewhere very much weighs against the 
value of your change. Everything you want to do can be done and is being done, 
except when someone chooses not to do it."

I heard a recent similar argument that goes: walls have been around since 
medieval times and they work really well... (Except they provably don't.)

As with the many problems with walls not being effective, you choose to ignore 
the legitimate issues pointed out on the list with the lack of italic 
standardization for Chinese braille, text to voice readers, etc.
The choice of plain text isn't always voluntary. And the existing alternatives, 
like math italic characters, are problematic.

tex

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
via Unicode
Sent: Wednesday, January 30, 2019 11:59 PM
To: Unicode Mailing List
Subject: Re: Encoding italic

On Wed, Jan 30, 2019 at 11:37 PM James Kass via Unicode
 wrote:
> As Tex Texin observed, differences of opinion as to where we draw the
> line between text and mark-up are somewhat ideological.  If a compelling
> case for handling italics at the plain-text level can be made, then the
> fact that italics can already be handled elsewhere doesn’t matter.  If a
> compelling case cannot be made, there are always alternatives.

To the extent I'd have ideology here, it's that that line is arbitrary
and needs to fit practical demands. Should we have eight-bit bytes?
I'm not sure that was the best solution, and other systems worked just
fine, but we've got a computing environment that makes anything else
unpractical. Unlike that question, italics has never been considered
part of plain text and has always been considered outside of plain
text. The fact that italics can be handled elsewhere very much weighs
against the value of your change. Everything you want to do can be
done and is being done, except when someone chooses not to do it.

-- 
Kie ekzistas vivo, ekzistas espero.

Re: Encoding italic

2019-01-31 Thread Andrew Cunningham via Unicode

On Thursday, 31 January 2019, James Kass via Unicode 
wrote:.
>
>
> As for use of other variant letter forms enabled by the math
> alphanumerics, the situation exists.  It’s an interesting phenomenon which
> is sometimes worthy of comment and relates to this thread because the math
> alphanumerics include italics.  One of the web pages referring to
> third-party input tools calls the practice “super cool Unicode text magic”.
>
>
Although not all devices can render such text. Many Android handsets on the
market do not have a sufficiently recent version of Android to have system
fonts that can render such existing usage.




-- 
Andrew Cunningham
lang.supp...@gmail.com

Re: Encoding italic

2019-01-31 Thread David Starner via Unicode

On Wed, Jan 30, 2019 at 11:37 PM James Kass via Unicode
 wrote:
> As Tex Texin observed, differences of opinion as to where we draw the
> line between text and mark-up are somewhat ideological.  If a compelling
> case for handling italics at the plain-text level can be made, then the
> fact that italics can already be handled elsewhere doesn’t matter.  If a
> compelling case cannot be made, there are always alternatives.

To the extent I'd have ideology here, it's that that line is arbitrary
and needs to fit practical demands. Should we have eight-bit bytes?
I'm not sure that was the best solution, and other systems worked just
fine, but we've got a computing environment that makes anything else
unpractical. Unlike that question, italics has never been considered
part of plain text and has always been considered outside of plain
text. The fact that italics can be handled elsewhere very much weighs
against the value of your change. Everything you want to do can be
done and is being done, except when someone chooses not to do it.

-- 
Kie ekzistas vivo, ekzistas espero.

RE: Encoding italic

2019-01-30 Thread Tex via Unicode

David, Asmus,
 
·   “without external standards, then it's simply impossible.”

·   “And without external standard, not interoperable.“

As you both know there are de jure as well as de facto standards. So for years 
people typed : - ) as a smiley without a de facto standard and at some point 
long before emoji, systems began converting these to smiley faces.

Even the utf-8 BOM began as one company’s non-interoperable convention for 
encoding identifier which later became part of the de facto standard.

Ideally interoperability means supported everywhere but we have many useful 
mechanisms that simply don’t do harm without being interpreted.

For example, Unicode relies on this for backward compatibility when it 
introduces new characters, properties, algorithms, et al that are not 
understood by all systems but are tolerated by older ones.

=

While I am at it, I am amused by the arguments earlier in this thread as well 
as other threads, that go:

·   If the feature was needed developers would have implemented it by now. 
It isn’t implemented so the standard doesn’t need it.

·   The feature was implemented without the standard, so we don’t need it 
in the standard.

If men were meant to fly they would have wings…

Apparently, for some, it is only when there are many conflicting 
implementations that a feature demonstrates both that it is a requirement and 
also that it should be standardized.

In fact, this is sometimes not a bad view as it prevents adding features to the 
standard that go unused yet add complexity. 

But, it can also set too high a bar. And often it isn’t a true criteria but 
just resistance to change.

You  don’t need italics. When I went to school we just tilted the terminal a 
few degrees and voila.

(You don’t need a car. When I went to school we walked 6 miles to get there. 
Uphill both ways. J )

tex

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Wednesday, January 30, 2019 10:20 PM
To: unicode@unicode.org
Subject: Re: Encoding italic

 

On 1/30/2019 7:46 PM, David Starner via Unicode wrote:

On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 <mailto:unicode@unicode.org>  wrote:

A new beta of BabelPad has been released which enables input, storing,
and display of italics, bold, strikethrough, and underline in plain-text

 
Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.
 

It's either "markdown" or control/tag sequences. Both are out of band 
information.

And without external standard, not interoperable.

A./

Re: Encoding italic

2019-01-30 Thread James Kass via Unicode




David Starner wrote,

>> ... italics, bold, strikethrough, and underline in plain-text
>
> Okay? Ed can do that too, along with nano and notepad. It's called
> HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
> without external standards, then it's simply impossible.

HTML source files are in plain-text.  Hopefully everyone on this list 
understands that and has already explored the marvelous benefits offered 
by granting users the ability to make exciting and effective page 
layouts via any plain-text editor.  HTML is standard and interchangeable.


As Tex Texin observed, differences of opinion as to where we draw the 
line between text and mark-up are somewhat ideological.  If a compelling 
case for handling italics at the plain-text level can be made, then the 
fact that italics can already be handled elsewhere doesn’t matter.  If a 
compelling case cannot be made, there are always alternatives.


As for use of other variant letter forms enabled by the math 
alphanumerics, the situation exists.  It’s an interesting phenomenon 
which is sometimes worthy of comment and relates to this thread because 
the math alphanumerics include italics.  One of the web pages referring 
to third-party input tools calls the practice “super cool Unicode text 
magic”.

Re: Encoding italic

2019-01-30 Thread Asmus Freytag via Unicode


  
  
On 1/30/2019 7:46 PM, David Starner via
  Unicode wrote:


  On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 wrote:

  
A new beta of BabelPad has been released which enables input, storing,
and display of italics, bold, strikethrough, and underline in plain-text

  
  
Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.



It's either "markdown" or control/tag
sequences. Both are out of band information.
And without external standard, not
interoperable.
A./

Re: Encoding italic

2019-01-30 Thread David Starner via Unicode

On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 wrote:
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text

Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.

-- 
Kie ekzistas vivo, ekzistas espero.

Re: Encoding italic

2019-01-30 Thread Asmus Freytag via Unicode


  
  
On 1/30/2019 4:38 PM, Kent Karlsson via
  Unicode wrote:


  I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)



No need to be sorry; we understand that the motivation is not so
  much advertising as giving a concrete example. It would be
  interesting if anything out there implements CMY(K). My
  expectation would be that this would be limited to interfaces for
  printers or their emulators.




  
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)



Muted colors are something that's become more popular as display
  hardware has improved. Modern displays are able to reproduce these
  both more predictably as well as with the necessary degree of
  contrast (although some users'/designer's fetish for low contrast
  text design is almost as bad as people randomly mixing "stark"
  FG/BG colors in the '90s.)




  

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.



Mapping all of these to CSS would be essential if you want this
  stuff to be interoperable.




  

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.

Systems that support "markdown", i.e. simplified markup to
  provide the most main-stream features of rich-text tend to do that
  with printable characters, for a reason. Perhaps two reasons.
Users find it preferable to have a visible fallback when
  "markdown" is not interpreted by a receiving system and users'
  generally like the ability to edit the markdown directly (even if,
  for convenience) there's some direct UI support for adding text
  styling.
Loading up the text with lots of invisible characters that may be
  deleted or copied out of order by someone working on a system that
  neither interprets nor displays these code points is an
  interoperability nightmare in my opinion.




  


Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" :


  
Kent Karlsson wrote:
 


  Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".


 
I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
has the advantage of not costing me USD 179, and it looks very similar
to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
are talking about: setting text display properties such as bold and
italics by means of escape sequences.
 
Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
doing, and if it does not, why we should not simply refer to the more
familiar 6429?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-30 Thread Kent Karlsson via Unicode

I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.

Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" :

> Kent Karlsson wrote:
>  
>> Yes, great. But as I've said, we've ALREADY got a
>> default-ignorable-in-display (if implemented right)
>> way of doing such things.
>> 
>> And not only do we already have one, but it is also
>> standardised in multiple standards from different
>> standards institutions. See for instance "ISO/IEC 8613-6,
>> Information technology --- Open Document Architecture (ODA)
>> and Interchange Format: Character content architecture".
>  
> I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
> has the advantage of not costing me USD 179, and it looks very similar
> to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
> are talking about: setting text display properties such as bold and
> italics by means of escape sequences.
>  
> Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
> doing, and if it does not, why we should not simply refer to the more
> familiar 6429?
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>

Re: Encoding italic

2019-01-30 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> Yes, great. But as I've said, we've ALREADY got a
> default-ignorable-in-display (if implemented right)
> way of doing such things.
>
> And not only do we already have one, but it is also
> standardised in multiple standards from different
> standards institutions. See for instance "ISO/IEC 8613-6,
> Information technology --- Open Document Architecture (ODA)
> and Interchange Format: Character content architecture".

I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
has the advantage of not costing me USD 179, and it looks very similar
to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
are talking about: setting text display properties such as bold and
italics by means of escape sequences.

Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
doing, and if it does not, why we should not simply refer to the more
familiar 6429?

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-30 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:

> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags.

Aha. That explains why SCSU had to be banished to the hut, right around
the same time the Plane 14 language tags were deprecated. In SCSU,
astral characters can be 1 byte just like BMP characters.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Kent Karlsson via Unicode

Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".
(In a little experiment I found that it seems that
Cygwin is one of the better implementations of this;
B.t.w. I have no relation to Cygwin other than using it.)

To boot, it's been around for decades and is still
alive and well. I see absolutely no need for a "bold"
new concept here; the one below is not better in any
significant way.

/Kent Karlsson

Den 2019-01-29 23:35, skrev "Andrew West via Unicode" :

> On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
>  wrote:
>> 
>> This <b>bold</b> new concept was not mine.  When I tested it
>> here, I was using the tag encoding recommended by the developer.
> 
> Congratulations James, you've successfully interchanged tag-styled
> plain text over the internet with no adverse side effects. I copied
> your email into BabelPad and your "bold" is shown bold (see attached
> screenshot).
> 
> Andrew

Re: Encoding italic

2019-01-29 Thread Andrew West via Unicode

On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
 wrote:
>
> This <b>bold</b> new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.

Congratulations James, you've successfully interchanged tag-styled
plain text over the internet with no adverse side effects. I copied
your email into BabelPad and your "bold" is shown bold (see attached
screenshot).

Andrew

Re: Encoding italic

2019-01-29 Thread James Kass via Unicode




Doug Ewell wrote,

> I can't speak for Andrew, but I strongly suspect he implemented this as
> a proof of concept, not to declare himself the Maker of Standards.

BabelPad also offers plain-text styling via math-alpha conversion, 
although this feature isn’t newly added.  Users interested in seeing how 
plain-text italics might work can try out the stateful approach using 
tags contrasted with the character-by-character approach using 
math-range italic letters.  (Of course, the math-range stuff is already 
being interchanged on the WWW, whilst the tagging method does not yet 
appear to be widely supported.)


A few miles upthread, ‘where are the third-party developers’ was asked.  
‘Everywhere’ is the answer.  Since third-party developers have to 
subsist on the crumbs dropped by the large corps, they tend to be 
responsive to user needs and requests.

Re: Encoding italic

2019-01-29 Thread James Kass via Unicode




On 2019-01-29 5:10 PM, Doug Ewell via Unicode wrote:

I thought we had established that someone had mentioned it on this list,
at some time during the past three weeks. Can someone look up what post
that was? I don't have time to go through scores of messages, and there
is no search facility.

http://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0209.html

Re: Encoding italic

2019-01-29 Thread Andrew West via Unicode

On Tue, 29 Jan 2019 at 10:25, Martin J. Dürst via Unicode
 wrote:
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.

And the tag characters (all except E0001) are now no longer
deprecated. As flag tag sequences are now a thing
(http://www.unicode.org/reports/tr51/#valid-emoji-tag-sequences), and
are widely supported (including on Twitter), your and PV's objections
to using tag characters for a plain text font styling protocol simply
because they are tag characters carry zero weight.

Andrew

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Martin J. Dürst wrote:

> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags. See https://tools.ietf.org/html/rfc2482 for details.
> Note that RFC 2482 has been obsoleted by
> https://tools.ietf.org/html/rfc6082, in parallel with a similar motion
> on the Unicode side.

I don't recall anyone mentioning Plane 14 language tags per se in this
thread. The tag characters themselves were un-deprecated to support
emoji flag sequences. But more on language tags in a moment.

> These tag characters were born only to shoot down an even worse
> proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For
> some additional background, please see
> https://tools.ietf.org/html/draft-ietf-acap-langtag-00.
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.

I agree that the ACAP proposal was awful, for many reasons and on many
levels. But in general, introducing a new standardized mechanism SO THAT
it can be deprecated is a crummy idea. It engenders bad feelings and
distrust among loyal users of the standard. Major software vendors, one
in particular starting with M, have been castigated for decades for
employing tactics similar to this.

> Bad ideas turn up once every 10 or 20 years. It usually takes some
> time for some of the people to realize that they are bad ideas. But
> that doesn't make them any better when they turn up again.

The suggestions over the past three weeks to encode basic styling in
plain text (I'm not saying I'm for or against that) have some
similarities with Plane 14 language tags: many people consider both
types of information to be meta-information, unsuitable for plain text,
and many of the suggested mechanisms are stateful, which is an anti-goal
of Unicode. But these are NOT the same idea, and the fact that they both
use Plane 14 tag characters doesn't make them so.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Kent Karlsson wrote:

> We already have a well-established standard for doing this kind of
> things...

I thought we were having this discussion because none of the existing
methods, no matter how well documented, has been accepted on a
widespread basis as "the" standard.

Some people dislike markdown because it looks like lightweight markup
(which it is), not like actual italics and boldface. Some dislike ISO
6429 because escape characters are invisible and might interfere with
other protocols (though they really shouldn't). Some dislike math
alphanumerics abuse because it's abuse, doesn't cover other writing
systems, etc.

I'd be happy to work with Kent to campaign for ISO 6429 as "the"
well-established standard for applying simple styling to plain text, but
we would have to acknowledge the significant challenges.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Doug Ewell via Unicode

Philippe Verdy replied to James Kass:
 
> You're not very explicit about the Tag encoding you use for these
> styles.
 
Of course, it was Andrew West who implemented the styling mechanism in a
beta release of BabelPad. James was just reporting on it.
 
> And what is then the interest compared to standard HTML
 
This entire discussion, for more than three weeks now, has been about
how to implement styling (e.g. italics) in plain text. Everyone knows it
can be done, and how to do it, in rich text.
 
> So you used "bold  U+E003E> I.e, you converted from ASCII to tag characters the full HTML
> sequences "" and "", including the HTML element name. I see
> little interest for that approach.
 
I thought we had established that someone had mentioned it on this list,
at some time during the past three weeks. Can someone look up what post
that was? I don't have time to go through scores of messages, and there
is no search facility.
 
I can't speak for Andrew, but I strongly suspect he implemented this as
a proof of concept, not to declare himself the Maker of Standards.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Encoding italic

2019-01-29 Thread Martin J . Dürst via Unicode

On 2019/01/28 05:03, James Kass via Unicode wrote:
> 
> A new beta of BabelPad has been released which enables input, storing, 
> and display of italics, bold, strikethrough, and underline in plain-text 
> using the tag characters method described earlier in this thread.  This 
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
>

I didn't say anything at the time this idea first came up, because I 
hoped people would understand that it was a bad idea.

Here's a little dirty secret about these tag characters: They were 
placed in one of the astral planes explicitly to make sure they'd use 4 
bytes per tag character, and thus quite a few bytes for any actual 
complete tags. See https://tools.ietf.org/html/rfc2482 for details. Note 
that RFC 2482 has been obsoleted by https://tools.ietf.org/html/rfc6082, 
in parallel with a similar motion on the Unicode side.

These tag characters were born only to shoot down an even worse 
proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For some 
additional background, please see 
https://tools.ietf.org/html/draft-ietf-acap-langtag-00.

The overall tag proposal had the desired effect: The original proposal 
to hijack some unused bytes in UTF-8 was defeated, and the tags itself 
were not actually used and therefore could be depreciated.

Bad ideas turn up once every 10 or 20 years. It usually takes some time 
for some of the people to realize that they are bad ideas. But that 
doesn't make them any better when they turn up again.

Regards,   Martin.

Re: Encoding italic

2019-01-29 Thread Martin J . Dürst via Unicode

On 2019/01/24 23:49, Andrew West via Unicode wrote:
> On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode
>  wrote:

> We were told time and time again when emoji were first proposed that
> they were required for encoding for interoperability with Japanese
> telecoms whose usage had spilled over to the internet. At that time
> there was no suggestion that encoding emoji was anything other than a
> one-off solution to a specific problem with PUA usage by different
> vendors, and I at least had no idea that emoji encoding would become a
> constant stream with an annual quota of 60+ fast-tracked
> user-suggested novelties. Maybe that was the hidden agenda, and I was
> just naïve.

I don't think this was a hidden agenda. Nobody in the US or Europe 
thought that emoji would catch on like they did, with ordinary people 
and the press. Of course they had been popular in Japan, that's why the 
got into Unicode.

> The ESC and UTC do an appallingly bad job at regulating emoji, and I
> would like to see the Emoji Subcommittee disbanded, and decisions on
> new emoji taken away from the UTC, and handed over to a consortium or
> committee of vendors who would be given a dedicated vendor-use emoji
> plane to play with (kinda like a PUA plane with pre-assigned
> characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which
> the vendors can then associate with glyphs as they see fit; and as
> emoji seem to evolve over time they would be free to modify and
> reassign glyphs as they like because the Unicode Standard would not
> define the meaning or glyph for any characters in this plane).

To a small extent, that already happens. The example I'm thinking about 
is the transition from a (potentially bullet-carrying) pistol to a 
waterpistol. The Unicode consortium doesn't define the meaning of any of 
it's characters, and doesn't define stardard glyphs for characters, just 
example glyphs. Another example is a presenter at a conference who was 
using lots of emoji saying that he will need to redo his presentation 
because the vendor of his notebook's OS was in the process of changing 
their emoji designs.

Regards,Martin.

Re: Encoding italic

2019-01-28 Thread Phake Nick via Unicode

2019-1-25 13:46, Garth Wallace via Unicode  wrote:

>
> On Wed, Jan 23, 2019 at 1:27 AM James Kass via Unicode <
> unicode@unicode.org> wrote:
>
>>
>> Nobody has really addressed Andrew West's suggestion about using the tag
>> characters.
>>
>> It seems conformant, unobtrusive, requiring no official sanction, and
>> could be supported by third-partiers in the absence of corporate
>> interest if deemed desirable.
>>
>> One argument against it might be:  Whoa, that's just HTML.  Why not just
>> use HTML?  SMH
>>
>> One argument for it might be:  Whoa, that's just HTML!  Most everybody
>> already knows about HTML, so a simple subset of HTML would be
>> recognizable.
>>
>> After revisiting the concept, it does seem elegant and workable. It
>> would provide support for elements of writing in plain-text for anyone
>> desiring it, enabling essential (or frivolous) preservation of
>> editorial/authorial intentions in plain-text.
>>
>> Am I missing something?  (Please be kind if replying.)
>>
>
> There is also RFC 1896 "enriched text", which is an attempt at a
> lightweight HTML substitute for styling in email. But these, and the ANSI
> escape code suggestion, seem like they're trying to solve the wrong problem
> here.
>
> Here's how I understand the situation:
> * Some people using forms of text or mostly-text communication that do not
> provide styling features want to use styling, for emphasis or personal flair
> * Some of these people caught on to the existence of the "styled"
> mathematical alphanumerics and, not caring that this is "wrong", started
> using them as a workaround
> * The use of these symbols, which are not technically equivalent to basic
> Latin, make posts inaccessible to screen readers, among other problems
>
> These are suggestions for Unicode to provide a different, more
> "acceptable" workaround for a lack of functionality in these social media
> systems (this mostly seems to be an issue with Twitter; IME this shows up
> much less on Facebook). But the root problem isn't the kludge, it's the
> lack of functionality in these systems: if Twitter etc. simply implemented
> some styling on their own, the whole thing would be a moot point.
> Essentially, this is trying to add features to Twitter without waiting for
> their development team.
>
> Interoperability is not an issue, since in modern computers copying and
> pasting styled text between apps works just fine.
>

How about outside social media system? For example, Chinese Braille have
symbols that indicate the start and end position of proper name mark and
book name mark punctuation, however when converted to plain text they
cannot be displayed with Unicode text because of the mindset that it should
be the task of styling software to render this punctuation, just because
the two punctuations are basically straight underline and wavy underline
beneath text in normal Chinese text.

>

Re: Encoding italic

2019-01-28 Thread Phake Nick via Unicode

Gmail can do *Märchen* although I am not too sure about how they transmit
such formatting and not sure about how interoperatable are they.

在 2019年1月22日週二 14:43，Adam Borowski via Unicode  寫道：

> On Mon, Jan 21, 2019 at 12:29:42AM -0800, David Starner via Unicode wrote:
> > On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode
> >  wrote:
> > >  Even though /we/ know how to do
> > > it and have software installed to help us do it.
> >
> > You're emailing from Gmail, which has support for italics in email.
>
> ... and how exactly can they send italics in an e-mail?  All they can do is
> to bundle a web page as an attachment, which some clients display instead
> of
> the main text.
>
> The e-mail's body text supports anything Unicode does, including
> 푖푡푎푙푖푐 and
> even ̏̋̃ ̉̀̋̉̂̕, but, remarkably, not italic umlauted characters,
> thai nor
> han.
>

Re: Encoding italic

2019-01-28 Thread Kent Karlsson via Unicode



Den 2019-01-28 02:53, skrev "James Kass via Unicode" :

> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose. 

It "works" basically only for English (note that any diacritics would be
placed suitable for math, not for words, and then there are Latin letters
that do not have a decomposition (like ø), and then there is of course
Cyrillic, and a whole slew of non-Latin scripts. So, no, they do NOT AT
ALL "seem well qualified". And... We already have a well-established
standard for doing this kind of things...

/Kent K

Re: Encoding italic

2019-01-28 Thread Philippe Verdy via Unicode

So you used
"bold 
I.e, you converted from ASCII to tag characters the full HTML sequences
"" and "", including the HTML element name. I see little interest
for that approach.

Additionally this means that U+E003C is the tag identifier and its scope
does not end for the rest of the text (the HTML close tag is closing the
previous Unicode tag but opens a new one, as the second sequence is not
, i.e. the Unicode tag-cancel).

I bet that a Unicode confirming code that treats some tag characters could
choose to remove everything in a Unicode tag that it does not understand
(e.g. U+E003C is not an understood identifier, only U+E0001 is understood
as a language tag) or does not want to parse but without the tag-cancel,
all the rest of your email could have been truncated, instead of just the
tagged text "bold".

Given how HTML tags are nesting(.. or not...), I don't think this approach
is desirable

And I'm not sure that everyone on this list actually received you mail with
this tag, it may have happened that your mail was truncated or all U+E00nn
characters were silently removed by an intermediate agent not wanting to
support any Unicode Tag character.

Le lun. 28 janv. 2019 à 03:03, James Kass via Unicode 
a écrit :

>
> On 2019-01-27 11:44 PM, Philippe Verdy wrote:
>
>  > You're not very explicit about the Tag encoding you use for these
> styles.
>
> This <b>bold</b> new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.
>
>  > Of course it must not be a language tag so the introducer is not
> U+E0001, or a cancel-all tag so it
>  > is not prefixed by U+E007F   It cannot also use letter-like,
> digit-like and hyphen-like tag characters
>  > for its introduction.  So probably you use some prefix in
> U+E0002..U+E001F and some additional tag
>  > (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S"
> for strikethough?) and the cancel
>  > tag to return to normal text (terminate the tagged sequence).
>
> Yes, U+E0001 remains deprecated and its use is strongly discouraged.
>
>  > Or may be you just use standard HTML encoding by adding U+E to
> each character of the HTML
>  > tag syntax (including attributes and close tags, allowing embedding?)
> So you use the "<" and ">" tag
>  > characters (possibly also the space tag U+E0020, or TAB tag U+E0009
> for separating attributes and the
>  > quotation tags for attribute values)?  Is your proposal also allowing
> the embedding of other HTML
>  > objects (such as SVG)?
>
> AFAICT, this beta release supports the tag sequences , ,
> , &  expressed here in ASCII.  I don’t know if the
> software developer has plans to expand the enhancements in the future.
>
>  > And what is then the interest compared to standard HTML (it is not
> more compact, ...
>
> This was one of the ideas which surfaced earlier in this thread. Some
> users have expressed an interest in preserving, for example, italics in
> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose.
> One of the advantages given for this approach earlier is that it can be
> made to work without any official sanction and with no action necessary
> by the Consortium.
>
>  > I bet in fact that all tag characters are most often restricted in
> text input forms, and will be
>  > silently discarded or the whole text will be rejected.
>
> In this e-mail, I used the tags  &  around the word “bold” in the
> first sentence of my reply in order to test your bet.
>
>  > We were told that these tag characters were deprecated, and in fact
> even their use for language
>  > tags has not found any significant use except some trials (but there
> are now better technologies
>  > available in lot of softwares, APIs and services, and application
> design/development tools, or
>  > document editing/publishing tools).
>
> Indeed, these tags were deprecated.  At the time the tags were
> deprecated, there was such sorrow on this list that some list members
> were even inspired to compose haiku lamenting their passing and did post
> those haiku to this list.  Now, thanks to emoji requirements, many of
> those tags are experiencing a resurrection/renaissance.  I wonder if
> anyone is composing limericks in joyful celebration…
>
>

Re: Encoding italic

2019-01-27 Thread James Kass via Unicode

On 2019-01-27 11:44 PM, Philippe Verdy wrote:

> You're not very explicit about the Tag encoding you use for these styles.

This <b>bold</b> new concept was not mine.  When I tested it 
here, I was using the tag encoding recommended by the developer.

> Of course it must not be a language tag so the introducer is not 
U+E0001, or a cancel-all tag so it
> is not prefixed by U+E007F   It cannot also use letter-like, 
digit-like and hyphen-like tag characters
> for its introduction.  So probably you use some prefix in 
U+E0002..U+E001F and some additional tag
> (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" 
for strikethough?) and the cancel

> tag to return to normal text (terminate the tagged sequence).

Yes, U+E0001 remains deprecated and its use is strongly discouraged.

> Or may be you just use standard HTML encoding by adding U+E to 
each character of the HTML
> tag syntax (including attributes and close tags, allowing embedding?) 
So you use the "<" and ">" tag
> characters (possibly also the space tag U+E0020, or TAB tag U+E0009 
for separating attributes and the
> quotation tags for attribute values)?  Is your proposal also allowing 
the embedding of other HTML

> objects (such as SVG)?

AFAICT, this beta release supports the tag sequences , , 
, &  expressed here in ASCII.  I don’t know if the 
software developer has plans to expand the enhancements in the future.

> And what is then the interest compared to standard HTML (it is not 
more compact, ...

This was one of the ideas which surfaced earlier in this thread. Some 
users have expressed an interest in preserving, for example, italics in 
plain-text and are uncomfortable using the math alphanumerics for this, 
although the math alphanumerics seem well qualified for the purpose.  
One of the advantages given for this approach earlier is that it can be 
made to work without any official sanction and with no action necessary 
by the Consortium.

> I bet in fact that all tag characters are most often restricted in 
text input forms, and will be

> silently discarded or the whole text will be rejected.

In this e-mail, I used the tags  &  around the word “bold” in the 
first sentence of my reply in order to test your bet.

> We were told that these tag characters were deprecated, and in fact 
even their use for language
> tags has not found any significant use except some trials (but there 
are now better technologies
> available in lot of softwares, APIs and services, and application 
design/development tools, or

> document editing/publishing tools).

Indeed, these tags were deprecated.  At the time the tags were 
deprecated, there was such sorrow on this list that some list members 
were even inspired to compose haiku lamenting their passing and did post 
those haiku to this list.  Now, thanks to emoji requirements, many of 
those tags are experiencing a resurrection/renaissance.  I wonder if 
anyone is composing limericks in joyful celebration…

Re: Encoding italic

2019-01-27 Thread Kent Karlsson via Unicode

Apart from that control sequences for (some) styling is standardised
(since decades by now), and the "tag characters" approach is not:

For the control sequences for styling, there is no pretence of nesting,
just setting/unsetting an aspect of styling. For  etc. (in tag
characters) there is at least the pretence/appearance of nesting, even
if the interpreter doesn't actually care about nesting (and just interprets
them as set/unset). (In addition,  etc. in "real" HTML are
1) disrecommended, and
2) the actual styling comes from a style sheet (and the **default**
one makes  stuff bold).)

/Kent K


Den 2019-01-27 21:03, skrev "James Kass via Unicode" :

> 
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.  This
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
>

Re: Encoding italic

2019-01-27 Thread Philippe Verdy via Unicode

You're not very explicit about the Tag encoding you use for these styles.

Of course it must not be a language tag so the introducer is not U+E0001,
or a cancel-all tag so it is not prefixed by U+E007F
It cannot also use letter-like, digit-like and hyphen-like tag characters
for its introduction.
So probably you use some prefix in U+E0002..U+E001F and some additional tag
(tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for
strikethough?) and the cancel tag to return to normal text (terminate the
tagged sequence).

Or may be you just use standard HTML encoding by adding U+E to each
character of the HTML tag syntax (including attributes and close tags,
allowing embedding?) So you use the "<" and ">" tag characters (possibly
also the space tag U+E0020, or TAB tag U+E0009 for separating attributes
and the quotation tags for attribute values)?
Is your proposal also allowing the embedding of other HTML objects (such as
SVG)?

In that case what you do is only to remap the HTML syntax outside the
standard text. If an attribute values contains standard text (such as ...) do you also remap the attribute value, i.e.
"Some text"? Do you remap the technical name of the HTML tag itself i.e.
"span" in the last example?

And what is then the interest compared to standard HTML (it is not more
compact, and just adds another layer on top of it), except allowing to
embed it in places where plain HTML would be restricted by form inputs or
would be reconverted using character entities hiding the effect of "<", ">"
and "&" in HTML so they are not reinterpreted as HTML but as plain-text
characters?

Now let's suppose that your convention starts being decoded and used in
some applications, this could be used to transport sensitive active scripts
(e.g. Javascript event handlers or plain

Re: Encoding italic

2019-01-27 Thread James Kass via Unicode




A new beta of BabelPad has been released which enables input, storing, 
and display of italics, bold, strikethrough, and underline in plain-text 
using the tag characters method described earlier in this thread.  This 
enhancement is described in the release notes linked on this download page:


http://www.babelstone.co.uk/Software/index.html

Re: Encoding italic

2019-01-25 Thread James Kass via Unicode




On 2019-01-26 12:18 AM, Asmus Freytag (c) responded:

On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
Assuming some mechanism for italics is added to Unicode,  when 
converting between the new plain text and HTML there is insufficient 
information to correctly convert to HTML. many elements may have 
italic stying and there would be no meta information in Unicode to 
indicate the appropriate HTML element.




So, we would be creating an interoperability issue.



What happens now when we convert plain-text to HTML?

Re: Encoding italic

2019-01-25 Thread Asmus Freytag (c) via Unicode


On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
Assuming some mechanism for italics is added to Unicode,  when 
converting between the new plain text and HTML there is insufficient 
information to correctly convert to HTML. many elements may have 
italic stying and there would be no meta information in Unicode to 
indicate the appropriate HTML element.




So, we would be creating an interoperability issue.

A./





On Friday, 25 January 2019, wjgo_10...@btinternet.com 
 via Unicode > wrote:


Asmus Freytag wrote;

Other schemes, like a VS per code point, also suffer from
being different in philosophy from "standard" rich text
approaches. Best would be as standard extension to all the
messaging systems (e.g. a common markdown language, supported
by UI).     A./


Yet that claim of what would be best would be stateful and
statefulness is the very thing that Unicode seeks to avoid.

Plain text is the basic system and a Variation Selector mechanism
after each character that is to become italicized is not stateful
and can be implemented using existing OpenType technology.

If an organization chooses to develop and use a rich text format
then that is a matter for that organization and any changing of
formatting of how italics are done when converting between plain
text and rich text is the responsibility of the organization that
introduces its rich text format.

Twitter was just an example that someone introduced along the way,
it was not the original request.

Also this is not only about messaging. Of primary importance is
the conservation of texts in plain text format, for example, where
a printed book has one word italicized in a sentence and the text
is being transcribed into a computer.

William Overington
Friday 25 January 2019



--
Andrew Cunningham
lang.supp...@gmail.com

Re: Encoding italic

2019-01-25 Thread Andrew Cunningham via Unicode

Assuming some mechanism for italics is added to Unicode,  when converting
between the new plain text and HTML there is insufficient information to
correctly convert to HTML. many elements may have italic stying and there
would be no meta information in Unicode to indicate the appropriate HTML
element.




On Friday, 25 January 2019, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> wrote:

> Asmus Freytag wrote;
>
> Other schemes, like a VS per code point, also suffer from being different
>> in philosophy from "standard" rich text approaches. Best would be as
>> standard extension to all the messaging systems (e.g. a common markdown
>> language, supported by UI). A./
>>
>
> Yet that claim of what would be best would be stateful and statefulness is
> the very thing that Unicode seeks to avoid.
>
> Plain text is the basic system and a Variation Selector mechanism after
> each character that is to become italicized is not stateful and can be
> implemented using existing OpenType technology.
>
> If an organization chooses to develop and use a rich text format then that
> is a matter for that organization and any changing of formatting of how
> italics are done when converting between plain text and rich text is the
> responsibility of the organization that introduces its rich text format.
>
> Twitter was just an example that someone introduced along the way, it was
> not the original request.
>
> Also this is not only about messaging. Of primary importance is the
> conservation of texts in plain text format, for example, where a printed
> book has one word italicized in a sentence and the text is being
> transcribed into a computer.
>
> William Overington
> Friday 25 January 2019
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com

1 2 3 4 5 6 7 8 9 >

1 - 100 of 883 matches

Mail list logo