Re: A last missing link for interoperable representation
Mark E. Shoulson wrote, > This discussion has been very interesting, really. I've heard what I > thought were very good points and relevant arguments from both/all > sides, and I confess to not being sure which I actually prefer. It's subjective, really. It depends on how one views plain-text and one's expectations for its future. Should plain-text be progressive, regressive, or stagnant? Because those are really the only choices. And opinions differ. Most of us involved with Unicode probably expect plain-text to be around for quite a while. The figure bandied about in the past on this list is "a thousand years". Only a society of mindless drones would cling to the past for a millennium. So, many of us probably figure that strictures laid down now will be overridden as a matter of course, over time. Unicode will probably be around for awhile, but the barrier between plain- and rich-text has already morphed significantly in the relatively short period of time it's been around. I became attracted to Unicode about twenty years ago. Because Unicode opened up entire /realms/ of new vistas relating to what could be done with computer plain text. I hope this trend continues.
Re: A last missing link for interoperable representation
On 2019-01-12 4:26 PM, wjgo_10...@btinternet.com wrote: I have now made, tested and published a font, VS14 Maquette, that uses VS14 to indicate italic. https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561 The italics don't happen in Notepad, but VS14 Maquette works spendidly in LibreOffice! (Windows 7) (In a *.TXT file) Since the VS characters are supposed to be used with officially registered/recognized sequences, it's possible that Notepad isn't trying to implement the feature. The official reception of the notion of using variant letter forms, such as italics, in plain-text is typically frosty. So advancement of plain-text might be left up to third-party developers, enthusiasts, and the actual text users. And there's nothing wrong with that. (It's non-conformant, though, unless the VS material is officially recognized/registered.) Non-Latin scripts, such as Khmer, may have their own traditions and conventions WRT special letter forms. Which is why starting at VS14 and working backwards might be inadequate in the long run. Khmer has letter forms called muul/moul/muol (not sure how to spell that one, but neither is anybody else). It superficially resembles fraktur for Khmer. Other non-Latin scripts may have a plethora of such forms/fonts/styles.
Re: A last missing link for interoperable representation
Just to add some more fuel for this fire, I note also the highly popular (in some places) technique of using Unicode letters that may have nothing whatsoever to do with the symbol or letter you mean to represent, apart from coincidental resemblance and looking "cool" enough. This happens a lot on Second Life, where you can set your "display name" distinct from your "user name", but the display name appears to be limited to Unicode *letters* and some punctuation, mostly, and certainly can't be outside the BMP. So for a sampling from stuff I've heard of... ΑbiΑИØ SŦээlSØul ΛPΉӨD ΛИƓĿƐĪƇ Ɗє ℓα ℜudǝ ωђitmαη ΛЯℂӨƧ BΛПDΣЯΛƧ ღLɪɴᴅᴀღ ђÅℵℵƔ Fashionablez ℬãŋќş Ķhaгg єσηα MιяєƖуηη ℒυςノσυʂ ツ . 乙u 乙u 尺αмση ℓυιѕ αуαℓα mღn ᄊムレo Ɩ'M ŦЯØЦßĿЄ ƧЄƝƖȤЄƝ ƓƠƬƊƛMMƖƬ™ øקςøги вαℓℓѕ ßⱥţţïţuđє Ąşђεгöη ĄĶЯĨ Ğrєץ Đ尺ѦႺΘȠ đ σ ℜ ι ค ℵ :. ĦΔZΔRĐ ʕ·ᴥ·ʔ ϮJΩƧӇƲΔϮ ϯcH ℭℛℯȡĩȵŧă ⓁợⒼαℵ 亗 Amy 亗 ßяуⒶℵ GяєуωσLƒ тαקקαt Wuηđǝяレǝ کhäşhι ℰղcαηϯäɖσƦ ۣღۣۜ Jarah Sparksۣღۣۜ ઇઉ fleur ઇઉ ໓яαкє ςнυяςн ڰۣღ- Pandora Barbarosڰۣღ- ஐ tenayah ஐ-x- ღⒹムяк 丂σuℓ™ღ ץlđє Ͼђץlɠє Լסяє ℳססɗү עΨ Gatatem ђαвίв Ψיע I could do more searching... Some of these things are even more common than shown here. Using ღ for a heart ♡ is extremely widespread, and decorations like 亗 and Ϯ abound. Note some decorations involving ღ with some Arabic(!) combining characters. Note the use of Hebrew and Arabic and CJK and other characters to represent Latin letters to which they bear only a passing resemblance. There are also a lot of names in all small-caps or all full-width (I didn't include any examples of just that because they seemed so ordinary), or "inverted" ·uoı̣ɥsɐɟ ꞁɐnsn əɥʇ uı̣ I don't know what, precisely, this argues for or against. Would people deny that this is an "abuse" of the character-set, even though people are doing it and it works for them? The medium is pretty indisputably plain-text. Should all this kind of thing be somehow made to "work" for these creative, if mystifying, people? These are clearly pretty far-out examples (though not extreme, compared to what's out there, nor uncommon, from what I have been told.) This discussion has been very interesting, really. I've heard what I thought were very good points and relevant arguments from both/all sides, and I confess to not being sure which I actually prefer. Just giving you more to think about... ~mark
Re: A last missing link for interoperable representation
Asmus Freytag wrote, > ...What this teaches you is that italicizing (or boldfacing) > text is fundamentally related to picking out parts of your > text in a different font. Typically from the same typeface, though. > So those screen readers got it right, except that they could > have used one of the more typical notational conventions that > the mathalphabetics are used to express (e.g. "vector" etc.), > rather than rattling off the Unicode name. WRT text-to-voice applications, such as "VoiceOver", I wonder how well they would do when encountering /any/ exotic text runs or characters. Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin text. For example: "The Han radical # 72, which looks like '日', means 'sun'." Would the application "say" the character as a Japanese reader would expect to hear it? Or in one of the Chinese dialects? Or would the application just give the hex code point? In an era where most of the states in my country no longer teach cursive writing in public schools, it seems unlikely that Twitter users (and so forth) will be clamoring for the ability to implement Chicago Style text properly on their cell phone screens. (Many users would probably prefer to use the cell phone to order a Chicago style pizza.) But, stranger things have happened.
Re: A last missing link for interoperable representation
On 12/01/2019 00:17, James Kass via Unicode wrote: […] The fact that the math alphanumerics are incomplete may have been part of what prompted Marcel Schneider to start this thread. No, really not at all. I didn’t even dream of having italics in Unicode working out of the box. That would exactly be the sort of demand that would have completely discredited me advocating the use of preformatted superscripts for the Unicode conformant and interoperable representation of a handful of languages spoken by one third of mankind and using the Latin script, while no other scripts are concerned with that orthographic feature. (No clear borderline between orthography and typography here, but with ordinal indicators in particular and abbreviation indicators in general we’re clearly on the orthographic side. (SC2/WG3 would agree, since they deemed "ª" and "º" worth encoding in 8-bit charsets.) It started when I found in the XKB keysymdef.h four dead keysyms added for Karl Pentzlin’s German T3, among which dead_lowline, and remembered that at some point in history, users were deprived of the means of typing a combining underscore. I didn’t think at the extra letterspacing (called “gesperrt” spaced out in German) that Mark E. Shoulson mentioned upthread, (a) because it isn’t used for that purpose in the locale I’m working for, and (b) because emulating it with interspersed NARROW NO-BREAK SPACEs would make that text unsearchable. If stringing encoded italic Latin letters into words is an abuse of Unicode, then stringing punctuation characters to simulate a "smiley" (☺) is an abuse of ASCII - because that's not what those punctuation characters are *for*. If my brain parses such italic strings into recognizable words, then I guess my brain is non-compliant. I think that like Google Search having extensive equivalence classes treating mathematical letters like plain ASCII, text-to-speech software could use a little bit of AI to recognize strings of those letters as ordinary words with emphasis, like James Kass suggested – the more as we’re actually able to add combining diacritics for correct spelling in some diacriticized alphabets (including a few with non-decomposable diacritics), though with somewhat less-than-optimal diacritic placement in many cases in the actual state of the art – and also parse ASCII art correspondingly, unlike what happened in another example shared on Twitter downthread of the math letters tweet: https://twitter.com/ourelectra/status/1083367552430989315 Thanks, Marcel
Re: A last missing link for interoperable representation
On 1/12/2019 5:22 AM, Richard Wordingham via Unicode wrote: On Sat, 12 Jan 2019 10:57:26 + (GMT) Julian Bradfield via Unicode wrote: It's also fundamentally misguided. When I _italicize_ a word, I am writing a word composed of (plain old) letters, and then styling the word; I am not composing a new and different word ("_italicize_") that is distinct from the old word ("italicize") by virtue of being made up of different letters. And what happens when you capitalise a word for emphasis or to begin a sentence? Is it no longer the same word? Typographically, the act of using italics or different font weight is more akin to using a different font than to using different letters. Not only did old metal types require the creation of a different font (albeit with a design coordinated with the regular type) but even in the digital world, purpose designed italic etc. typefaces beat attempts at parametrizing regular fonts. (Although some of the intelligence that goes into creating those designs can nowadays be approximated by automation). What this teaches you is that italicizing (or boldfacing) text is fundamentally related to picking out parts of your text in a different font. It's an operation on a span of text, not something that results in different letters (or letter attributes). Deep in the age of metal type this would have been no surprise to users. As I had occasion to mention before, some languages had the (rather universally observed) typographical convention of setting apart foreign term apart by using a different font (Antiqua vs. Fraktur for ordinary text). At the same time, other languages used italics for the same purpose (which technically also meant using a different typeface). To go further, the use of typography to mark emphasis also followed conventions that focused on spans of letters not on the individual letters. For example, in Fraktur, you would never have been able to emphasize a single letter, as emphasis was conveyed by increased inter-letter spacing. (That restriction was not as limiting as it appears in languages that do not have single-letter words). Anyway, this points to a way to make the distinction between plain text and rich text a more principled one (and explains why math alphabets seemingly form an exception). The domain of rich text are all typographic and stylistic elements that establish spans of text, whether that is underlining, emphasis, letter spacing, font weight, type face selection or whatever. Plain text deals with letters in a way that is as stateless as possible, that is, does not set up spans. Math alphabetics are an exception by virtue of the fact that they are individual letters that have a particular identity different from the "same" letter in text or the "same" letter that's part of a different math alphabet. So those screen readers got it right, except that they could have used one of the more typical notational conventions that the mathalphabetics are used to express (e.g. "vector" etc.), rather than rattling off the Unicode name. To reiterate, if you effectively require a span (even if you could simulate that differently) you are in the realm or rich text. The one big exception to that is bidi, because it is utterly impossible to do bidi text without text ranges. Therefore, Unicode plain text explicitly violates that principle in favor of achieving a fundamental goal of universality, that is being able to include the bidi languages. None of the other uses contemplated here rise to the same level of violating a fundamental goal in the same way. A./
Re: A last missing link for interoperable representation
James Kass wrote: For the V.S. option there should be a provision for consistency and open-endedness to keep it simple. Start with VS14 and work backwards for italic, … I have now made, tested and published a font, VS14 Maquette, that uses VS14 to indicate italic. https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561 William Overington Saturday 12 January 2019 -- Original Message -- From: "James Kass via Unicode" To: unicode@unicode.org Sent: Friday, 2019 Jan 11 At 01:48 Subject: Re: A last missing link for interoperable representation Richard Wordingham responded, ... simply using an existing variation selector character to do the job. Actually, this might be a superior option. For the V.S. option there should be a provision for consistency and open-endedness to keep it simple. Start with VS14 and work backwards for italic, fraktur, antiqua... (whatever the preferred order works out to be). Or (better yet) start at VS17 and move forward (and change the rule that seventeen and up is only for CJK). Is it true that many of the CJK variants now covered were previously considered by the Consortium to be merely stylistic variants?
Re: A last missing link for interoperable representation
On Sat, 12 Jan 2019 14:21:19 + James Kass via Unicode wrote: > FWIW, the math formula: > a + b # 푏 + 푎 > ... becomes invalid if normalized NFKD/NFKC. (Or if copy/pasted from > an HTML page using marked-up ASCII into a plain-text editor.) (a) Italic versus plain is not significant in the mathematics I've encountered. It's worse than distinguishing capital em and capital mu, which is allowed if you're the head of department. (b) a + b # b + a is a general, but not universally true, statement for ordinal numbers, the simplest example being ω = 1 + ω ≠ ω + 1 (c) You're talking about a folding, not a normalisation. The example you want would use emboldening, e.g. "In general, 푎 + 퐛 텊≠ 퐚 + 푏" which is true for vectors 텊퐚 and 퐛 if one is treating the quaternions as a direct sum of reals and real 3-vectors. Richard.
Re: A last missing link for interoperable representation
Reading & writing & 'rithmatick... This is a math formula: a + b = b + a ... where the estimable "mathematician" used Latin letters from ASCII as though they were math alphanumerics variables. This is an italicized word: 푘푎푘푖푠푡표푐푟푎푐푦 ... where the "geek" hacker used Latin italics letters from the math alphanumeric range as though they were Latin italics letters. Where's the harm? FWIW, the math formula: a + b # 푏 + 푎 ... becomes invalid if normalized NFKD/NFKC. (Or if copy/pasted from an HTML page using marked-up ASCII into a plain-text editor.) Yet the italicized word "kakistocracy" is still legible if normalized. If copy/pasted from an HTML page using the math alphanumeric characters, it survives intact. If copy/pasted from markupped ASCII, it's still legible.
Re: A last missing link for interoperable representation
Julian Bradford wrote, * Bradfield, sorry.
Re: A last missing link for interoperable representation
Julian Bradford wrote, "It does not work with much existing technology. Interspersing extra codepoints into what is otherwise plain text breaks all the existing software that has not been, and never will be updated to deal with arbitrarily complex algorithms required to do Unicode searching. Somebody who need to search exotic East Asian text will know that they need software that understands VS, but a plain ordinary language user is unlikely to have any idea that VS exist, or that their searches will mysteriously fail if they use this snazzy new pseudo-plain-text italicization technique" Sounds like you didn't try it. VS characters are default ignorable. First one is straight, the second one has VS2 characters interspersed and after the "t": apricot a︁p︁r︁i︁c︁o︁t︁ Notepad finds them both if you type the word "apricot" into the search box. "..." Regardless of how you input italics in rich-text, you are putting italic forms into the display. "I think the VS or combining format character approach *would* have been a better way to deal with the mess of mathematical alphabets, ..." I think so, too, but since I'm not a member of *that* user community, my opinion hasn't much value. Plus VS characters were encoded after the math stuff. "But for plain text, it's crazy." Are you a member of the plain-text user community?
Re: A last missing link for interoperable representation
On 2019-01-11, James Kass via Unicode wrote: > Exactly. William Overington has already posted a proof-of-concept here: > https://forum.high-logic.com/viewtopic.php?f=10=7831 > ... using a P.U.A. character /in lieu/ of a combining formatting or VS > character. The concept is straightforward and works properly with > existing technology. It does not work with much existing technology. Interspersing extra codepoints into what is otherwise plain text breaks all the existing software that has not been, and never will be updated to deal with arbitrarily complex algorithms required to do Unicode searching. Somebody who need to search exotic East Asian text will know that they need software that understands VS, but a plain ordinary language user is unlikely to have any idea that VS exist, or that their searches will mysteriously fail if they use this snazzy new pseudo-plain-text italicization technique It's also fundamentally misguided. When I _italicize_ a word, I am writing a word composed of (plain old) letters, and then styling the word; I am not composing a new and different word ("_italicize_") that is distinct from the old word ("italicize") by virtue of being made up of different letters. I think the VS or combining format character approach *would* have been a better way to deal with the mess of mathematical alphabets, because for mathematicians, *b* is a distinct symbol from b, and while there may be correlated use of alphabets, there need be no connection whatever between something notated b and something notated *b*. But for plain text, it's crazy. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.