Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Mark E. Shoulson wrote,

> This discussion has been very interesting, really.  I've heard what I
> thought were very good points and relevant arguments from both/all
> sides, and I confess to not being sure which I actually prefer.

It's subjective, really.  It depends on how one views plain-text and 
one's expectations for its future.  Should plain-text be progressive, 
regressive, or stagnant?  Because those are really the only choices.  
And opinions differ.


Most of us involved with Unicode probably expect plain-text to be around 
for quite a while.  The figure bandied about in the past on this list is 
"a thousand years".  Only a society of mindless drones would cling to 
the past for a millennium.  So, many of us probably figure that 
strictures laid down now will be overridden as a matter of course, over 
time.


Unicode will probably be around for awhile, but the barrier between 
plain- and rich-text has already morphed significantly in the relatively 
short period of time it's been around.


I became attracted to Unicode about twenty years ago.  Because Unicode 
opened up entire /realms/ of new vistas relating to what could be done 
with computer plain text.  I hope this trend continues.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



On 2019-01-12 4:26 PM, wjgo_10...@btinternet.com wrote:
I have now made, tested and published a font, VS14 Maquette, that uses 
VS14 to indicate italic.


https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561



The italics don't happen in Notepad, but VS14 Maquette works spendidly 
in LibreOffice!  (Windows 7)  (In a *.TXT file)


Since the VS characters are supposed to be used with officially 
registered/recognized sequences, it's possible that Notepad isn't trying 
to implement the feature.


The official reception of the notion of using variant letter forms, such 
as italics, in plain-text is typically frosty.  So advancement of 
plain-text might be left up to third-party developers, enthusiasts, and 
the actual text users.  And there's nothing wrong with that.  (It's 
non-conformant, though, unless the VS material is officially 
recognized/registered.)


Non-Latin scripts, such as Khmer, may have their own traditions and 
conventions WRT special letter forms.  Which is why starting at VS14 and 
working backwards might be inadequate in the long run.


Khmer has letter forms called muul/moul/muol (not sure how to spell that 
one, but neither is anybody else).  It superficially resembles fraktur 
for Khmer.  Other non-Latin scripts may have a plethora of such 
forms/fonts/styles.




Re: A last missing link for interoperable representation

2019-01-12 Thread Mark E. Shoulson via Unicode
Just to add some more fuel for this fire, I note also the highly popular 
(in some places) technique of using Unicode letters that may have 
nothing whatsoever to do with the symbol or letter you mean to 
represent, apart from coincidental resemblance and looking "cool" 
enough.  This happens a lot on Second Life, where you can set your 
"display name" distinct from your "user name", but the display name 
appears to be limited to Unicode *letters* and some punctuation, mostly, 
and certainly can't be outside the BMP.  So for a sampling from stuff 
I've heard of...


ΑbiΑИØ SŦээlSØul
ΛPΉӨD
ΛИƓĿƐĪƇ  Ɗє ℓα ℜudǝ ωђitmαη
ΛЯℂӨƧ BΛПDΣЯΛƧ
ღLɪɴᴅᴀღ
ђÅℵℵƔ Fashionablez ℬãŋќş Ķhaгg
єσηα MιяєƖуηη
ℒυςノσυʂ ツ .
乙u 乙u
尺αмση ℓυιѕ αуαℓα
mღn
ᄊムレo
Ɩ'M ŦЯØЦßĿЄ ƧЄƝƖȤЄƝ ƓƠƬƊƛMMƖƬ™
øקςøги вαℓℓѕ ßⱥţţïţuđє
Ąşђεгöη ĄĶЯĨ Ğrєץ
Đ尺ѦႺΘȠ
đ σ  ℜ ι ค ℵ :.
ĦΔZΔRĐ
ʕ·ᴥ·ʔ
ϮJΩƧӇƲΔϮ
ϯcH ℭℛℯȡĩȵŧă
ⓁợⒼαℵ
亗 Amy 亗
ßяуⒶℵ GяєуωσLƒ
тαקקαt Wuηđǝяレǝ
کhäşhι ℰղcαηϯäɖσƦ
ۣღۣۜ Jarah Sparksۣღۣۜ
ઇઉ fleur ઇઉ
໓яαкє ςнυяςн
ڰۣღ- Pandora Barbarosڰۣღ-
ஐ tenayah ஐ-x-
ღⒹムяк 丂σuℓ™ღ
ץlđє Ͼђץlɠє
Լסяє ℳססɗү
עΨ Gatatem ђαвίв Ψיע

I could do more searching... Some of these things are even more common 
than shown here.  Using ღ for a heart ♡ is extremely widespread, and 
decorations like 亗 and Ϯ abound.  Note some decorations involving ღ with 
some Arabic(!) combining characters. Note the use of Hebrew and Arabic 
and CJK and other characters to represent Latin letters to which they 
bear only a passing resemblance.  There are also a lot of names in all 
small-caps or all full-width (I didn't include any examples of just that 
because they seemed so ordinary), or "inverted"  ·uoı̣ɥsɐɟ ꞁɐnsn əɥʇ uı̣


I don't know what, precisely, this argues for or against.  Would people 
deny that this is an "abuse" of the character-set, even though people 
are doing it and it works for them?  The medium is pretty indisputably 
plain-text.  Should all this kind of thing be somehow made to "work" for 
these creative, if mystifying, people? These are clearly pretty far-out 
examples (though not extreme, compared to what's out there, nor 
uncommon, from what I have been told.)


This discussion has been very interesting, really.  I've heard what I 
thought were very good points and relevant arguments from both/all 
sides, and I confess to not being sure which I actually prefer.  Just 
giving you more to think about...


~mark



Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Asmus Freytag wrote,

> ...What this teaches you is that italicizing (or boldfacing)
> text is fundamentally related to picking out parts of your
> text in a different font.

Typically from the same typeface, though.

> So those screen readers got it right, except that they could
> have used one of the more typical notational conventions that
> the mathalphabetics are used to express (e.g. "vector" etc.),
> rather than rattling off the Unicode name.

WRT text-to-voice applications, such as "VoiceOver", I wonder how well 
they would do when encountering /any/ exotic text runs or characters.  
Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin 
text.  For example:  "The Han radical # 72, which looks like '日', means 
'sun'."  Would the application "say" the character as a Japanese reader 
would expect to hear it?  Or in one of the Chinese dialects?  Or would 
the application just give the hex code point?


In an era where most of the states in my country no longer teach cursive 
writing in public schools, it seems unlikely that Twitter users (and so 
forth) will be clamoring for the ability to implement Chicago Style text 
properly on their cell phone screens.  (Many users would probably prefer 
to use the cell phone to order a Chicago style pizza.)  But, stranger 
things have happened.




Re: A last missing link for interoperable representation

2019-01-12 Thread Marcel Schneider via Unicode

On 12/01/2019 00:17, James Kass via Unicode wrote:
[…]

The fact that the math alphanumerics are incomplete may have been
part of what prompted Marcel Schneider to start this thread.


No, really not at all. I didn’t even dream of having italics in Unicode
working out of the box. That would exactly be the sort of demand that
would have completely discredited me advocating the use of preformatted
superscripts for the Unicode conformant and interoperable representation
of a handful of languages spoken by one third of mankind and using the
Latin script, while no other scripts are concerned with that orthographic
feature. (No clear borderline between orthography and typography here,
but with ordinal indicators in particular and abbreviation indicators in
general we’re clearly on the orthographic side. (SC2/WG3 would agree,
since they deemed "ª" and "º" worth encoding in 8-bit charsets.)

It started when I found in the XKB keysymdef.h four dead keysyms added
for Karl Pentzlin’s German T3, among which dead_lowline, and remembered
that at some point in history, users were deprived of the means of typing
a combining underscore. I didn’t think at the extra letterspacing (called
“gesperrt” spaced out in German) that Mark E. Shoulson mentioned upthread,
(a) because it isn’t used for that purpose in the locale I’m working for,
and (b) because emulating it with interspersed NARROW NO-BREAK SPACEs
would make that text unsearchable.



If stringing encoded italic Latin letters into words is an abuse of
Unicode, then stringing punctuation characters to simulate a "smiley"
(☺) is an abuse of ASCII - because that's not what those punctuation
characters are *for*.  If my brain parses such italic strings into
recognizable words, then I guess my brain is non-compliant.


I think that like Google Search having extensive equivalence classes
treating mathematical letters like plain ASCII, text-to-speech software
could use a little bit of AI to recognize strings of those letters as
ordinary words with emphasis, like James Kass suggested – the more as
we’re actually able to add combining diacritics for correct spelling
in some diacriticized alphabets (including a few with non-decomposable
diacritics), though with somewhat less-than-optimal diacritic placement
in many cases in the actual state of the art – and also parse ASCII art
correspondingly, unlike what happened in another example shared on
Twitter downthread of the math letters tweet:

https://twitter.com/ourelectra/status/1083367552430989315

Thanks,

Marcel


Re: A last missing link for interoperable representation

2019-01-12 Thread Asmus Freytag via Unicode

  
  
On 1/12/2019 5:22 AM, Richard
  Wordingham via Unicode wrote:


  On Sat, 12 Jan 2019 10:57:26 + (GMT)
Julian Bradfield via Unicode  wrote:


  
It's also fundamentally misguided. When I _italicize_ a word, I am
writing a word composed of (plain old) letters, and then styling the
word; I am not composing a new and different word ("_italicize_") that
is distinct from the old word ("italicize") by virtue of being made up
of different letters.

  
  
And what happens when you capitalise a word for emphasis or to begin a
sentence?  Is it no longer the same word?



Typographically, the act of using italics or different font
  weight is more akin to using a different font than to using
  different letters. Not only did old metal types require the
  creation of a different font (albeit with a design coordinated
  with the regular type) but even in the digital world, purpose
  designed italic etc. typefaces beat attempts at parametrizing
  regular fonts. (Although some of the intelligence that goes into
  creating those designs can nowadays be approximated by
  automation).
What this teaches you is that italicizing (or boldfacing) text is
  fundamentally related to picking out parts of your text in a
  different font. It's an operation on a span of text, not something
  that results in different letters (or letter attributes).
Deep in the age of metal type this would have been no surprise to
  users. As I had occasion to mention before, some languages had the
  (rather universally observed) typographical convention of setting
  apart foreign term apart by using a different font (Antiqua vs.
  Fraktur for ordinary text). At the same time, other languages used
  italics for the same purpose (which technically also meant using a
  different typeface).
To go further, the use of typography to mark emphasis also
  followed conventions that focused on spans of letters not on the
  individual letters. For example, in Fraktur, you would never have
  been able to emphasize a single letter, as emphasis was conveyed
  by increased inter-letter spacing. (That restriction was not as
  limiting as it appears in languages that do not have single-letter
  words).
Anyway, this points to a way to make the distinction between
  plain text and rich text a more principled one (and explains why
  math alphabets seemingly form an exception).
The domain of rich text are all typographic and stylistic
  elements that establish spans of text, whether that is
  underlining, emphasis, letter spacing, font weight, type face
  selection or whatever. Plain text deals with letters in a way that
  is as stateless as possible, that is, does not set up spans. Math
  alphabetics are an exception by virtue of the fact that they are
  individual letters that have a particular identity different from
  the "same" letter in text or the "same" letter that's part of a
  different math alphabet.
So those screen readers got it right, except that they could have
  used one of the more typical notational conventions that the
  mathalphabetics are used to express (e.g. "vector" etc.), rather
  than rattling off the Unicode name.
To reiterate, if you effectively require a span (even if you
  could simulate that differently) you are in the realm or rich
  text. The one big exception to that is bidi, because it is utterly
  impossible to do bidi text without text ranges. Therefore, Unicode
  plain text explicitly violates that principle in favor of
  achieving a fundamental goal of universality, that is being able
  to include the bidi languages.
None of the other uses contemplated here rise to the same level
  of violating a fundamental goal in the same way.
A./

  



Re: A last missing link for interoperable representation

2019-01-12 Thread wjgo_10...@btinternet.com via Unicode

James Kass wrote:

For the V.S. option there should be a provision for consistency and 
open-endedness to keep it simple.  Start with VS14 and work backwards 
for italic, …


I have now made, tested and published a font, VS14 Maquette, that uses 
VS14 to indicate italic.


https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561

William Overington
Saturday 12 January 2019



-- Original Message --
From: "James Kass via Unicode" 
To: unicode@unicode.org
Sent: Friday, 2019 Jan 11 At 01:48
Subject: Re: A last missing link for interoperable representation


Richard Wordingham responded,


... simply using an existing variation
selector character to do the job.


Actually, this might be a superior option.


For the V.S. option there should be a provision for consistency and 
open-endedness to keep it simple.  Start with VS14 and work backwards 
for italic, fraktur, antiqua...  (whatever the preferred order works out 
to be).  Or (better yet) start at VS17 and move forward (and change the 
rule that seventeen and up is only for CJK).


Is it true that many of the CJK variants now covered were previously 
considered by the Consortium to be merely stylistic variants?





Re: A last missing link for interoperable representation

2019-01-12 Thread Richard Wordingham via Unicode
On Sat, 12 Jan 2019 14:21:19 +
James Kass via Unicode  wrote:

> FWIW, the math formula:
> a + b # 푏 + 푎
> ... becomes invalid if normalized NFKD/NFKC.  (Or if copy/pasted from
> an HTML page using marked-up ASCII into a plain-text editor.)

(a) Italic versus plain is not significant in the mathematics I've
encountered.  It's worse than distinguishing capital em and capital mu,
which is allowed if you're the head of department.

(b) a + b # b + a is a general, but not universally true, statement for
ordinal numbers, the simplest example being

ω = 1 + ω ≠ ω + 1

(c) You're talking about a folding, not a normalisation.

The example you want would use emboldening, e.g.

"In general, 푎 + 퐛   텊≠ 퐚 + 푏"

which is true for vectors 텊퐚 and 퐛 if one is treating the
quaternions as a direct sum of reals and real 3-vectors.

Richard.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Reading & writing & 'rithmatick...

This is a math formula:
a + b = b + a
... where the estimable "mathematician" used Latin letters from ASCII as 
though they were math alphanumerics variables.


This is an italicized word:
푘푎푘푖푠푡표푐푟푎푐푦
... where the "geek" hacker used Latin italics letters from the math 
alphanumeric range as though they were Latin italics letters.


Where's the harm?

FWIW, the math formula:
a + b # 푏 + 푎
... becomes invalid if normalized NFKD/NFKC.  (Or if copy/pasted from an 
HTML page using marked-up ASCII into a plain-text editor.)


Yet the italicized word "kakistocracy" is still legible if normalized.  
If copy/pasted from an HTML page using the math alphanumeric characters, 
it survives intact.  If copy/pasted from markupped ASCII, it's still 
legible.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode




Julian Bradford wrote,


* Bradfield, sorry.



Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Julian Bradford wrote,

"It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique"

Sounds like you didn't try it.  VS characters are default ignorable.

First one is straight, the second one has VS2 characters interspersed 
and after the "t":

apricot
a︁p︁r︁i︁c︁o︁t︁
Notepad finds them both if you type the word "apricot" into the search box.

"..."

Regardless of how you input italics in rich-text, you are putting italic 
forms into the display.


"I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets, ..."

I think so, too, but since I'm not a member of *that* user community, my 
opinion hasn't much value.  Plus VS characters were encoded after the 
math stuff.


"But for plain text, it's crazy."

Are you a member of the plain-text user community?



Re: A last missing link for interoperable representation

2019-01-12 Thread Julian Bradfield via Unicode
On 2019-01-11, James Kass via Unicode  wrote:
> Exactly.  William Overington has already posted a proof-of-concept here:
> https://forum.high-logic.com/viewtopic.php?f=10=7831
> ... using a P.U.A. character /in lieu/ of a combining formatting or VS 
> character.  The concept is straightforward and works properly with 
> existing technology.

It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique

It's also fundamentally misguided. When I _italicize_ a word, I am
writing a word composed of (plain old) letters, and then styling the
word; I am not composing a new and different word ("_italicize_") that
is distinct from the old word ("italicize") by virtue of being made up
of different letters.

I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets,
because for mathematicians, *b* is a distinct symbol from b, and while
there may be correlated use of alphabets, there need be no connection
whatever between something notated b and something notated *b*.

But for plain text, it's crazy.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.