Re: abstract characters, semantics, meaningful transformations ... Was: Tibetan Paluta

2017-05-01 Thread David Starner via Unicode
On Mon, May 1, 2017 at 7:26 AM Naena Guru via Unicode 
wrote:

> This whole attempt to make digitizing Indic script some esoteric,
> 'abstract', 'semantic representation' and so on seems to me is an attempt
> to make Unicode the realm of the some super humans.
>
Unicode is like writing. At its core, it is a hairy esoteric mess; mix
these certain chemicals the right ways, and prepare a writing implement and
writing surface in the right (non-trivial) ways, and then manipulate that
implement carefully to make certain marks that have unclear delimitations
between correct and incorrect. But in the end, as much of that is removed
from the problem of the user as possible; in the case of modern
word-processing system, it's a matter of hitting the keys and then hitting
print, in complete ignorance of all the silicon and printing magic going on
between.

Unicode is not the realm of everyone; it's the realm of people with a
certain amount of linguistic knowledge and computer knowledge. There's only
a problem if those people can't make it usable for the everyday programmer
and therethrough to the average person.

> The purpose of writing is to represent speech.
>
Meh. The purpose of writing is to represent language, which may be
unrelated to speech (like in the case of SignWriting and mathematics) or
somewhat related to speech--very few forms of writing are direct
transcriptions of speech. Even the closest tend to exchange a lot of
intonation details for punctuation that reveals different information.

> English writing was massacred when printing was brought in from Europe.
>
No, it wasn't. Printing made no difference to the fact that English has a
dozen vowels with five letters to write them. The thorn has little impact
on the ambiguity of English writing. The problem with printing is that it
fossilizes the written language, and our spellings have stayed the same
while the pronunciations have changed. And the dissociation of sound and
writing sometimes helps English; even when two English speakers from
different parts of the world would have trouble understanding each other,
writing is usually not so impaired.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread David Starner via Unicode
On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode <
unicode@unicode.org> wrote:

> Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the
> case for other situations


UTF-8 is clearly more efficient space-wise that includes more ASCII
characters than characters between U+0800 and U+. Given the prevalence
of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic
will pretty much always be smaller in UTF-8.

Even for scripts that go from 2 bytes to 3, webpages can get much smaller
in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a
factor of 1.8). The max change in reverse is 1.5, as two bytes goes to
three.


> and the fact is that handling surrogates (which is what proponents of
> UTF-8 or UCS-4 usually focus on) is no more complicated than handling
> combining characters, which you have to do anyway.
>

Not necessarily; you can legally process Unicode text without worrying
about combining characters, whereas you cannot process UTF-16 without
handling surrogates.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> That’s true anyway; imagine the database holds raw bytes, that just happen
> to decode to U+FFFD.  There might seem to be *two* names that both contain
> U+FFFD in the same place.  How do you distinguish between them?
>

If the database holds raw bytes, then the name is a byte string, not a
Unicode string, and can't contain U+FFFD at all. It's a relatively easy
rule to make and enforce that a string in a database is a validly formatted
string; I would hope that most SQL servers do in fact reject malformed
UTF-8 strings. On the other hand, I'd expect that an SQL server would
accept U+FFFD in a Unicode string.


> I don’t see a problem; the point is that where a structurally valid UTF-8
> encoding has been used, albeit in an invalid manner (e.g. encoding a number
> that is not a valid code point, or encoding a valid code point as an
> over-long sequence), a single U+FFFD is appropriate.  That seems a
> perfectly sensible rule to adopt.
>

It seems like a perfectly arbitrary rule to adopt; I'd like to assume that
the only source of such UTF-8 data is willful attempts to break security,
and in that case, how is this a win? Nonattack sources of broken data are
much more likely to be the result of mixing UTF-8 with other character
encodings or raw binary data.

>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves), because it has
> no way to know what that code point represents.
>

Which causes various other security problems; if an object (file, database
element, etc.) gets a name with a FFFD in it, it becomes impossible to
reference. That an IEEE 754 float may not equal itself is a perpetual
source of confusion for programmers.


> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)
>
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t want to decode it as a NUL (that was the source of
> previous security bugs, as I recall), I also don’t see the logic in
> insisting that it must be decoded to *three* code points when it clearly
> only represented one in the input.
>

In this case, It's pretty clear, but I don't see it as a general rule.  Any
rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or
mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not
going to insist that it get replaced with U+FFFD U+FFFD because it's clear
(to me) it was meant as two characters.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-04 Thread David Starner via Unicode
On Sun, Jun 4, 2017 at 9:13 PM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> Sorry to be late with this, but if 20.1 bits turn out to not be enough,
> what about 21 bits?
>
> That would still limit UTF-8 to four bytes, but would almost double the
> code space. Assuming (conservatively) that it will take about a century
> to fill up all 17 (well, actually 15, because two are private) planes,
> this would give us another century.
>
> Just one more crazy idea :-(.
>

It seems hard to estimate the value of that, without knowing why we ran out
of characters. A slow collection of a huge number of Chinese ideographs and
new Native American scripts, maybe. Access to a library with a trillion
works over billions of years from millions of species, probably not. Given
that we're in no risk of running out of characters right now, speculating
on this seems pointless.


Fwd: Unicode education in Schools

2017-08-24 Thread David Starner via Unicode
-- Forwarded message -
From: David Starner 
Date: Thu, Aug 24, 2017, 6:16 PM
Subject: Re: Unicode education in Schools
To: Richard Wordingham 




On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
>
> Richard.
>

Steer them away from reinventing the wheel. If they use Java, use Java
strings. If they're using GTK, use strings compatible with GTK. If they're
writing JavaScript, use JavaScript strings. There's basically no system
without Unicode strings or that they would be better off rewriting the
wheel.

>


Linearized tilde?

2017-12-29 Thread David Starner via Unicode
https://en.wikipedia.org/wiki/African_reference_alphabet says "The 1982
revision of the alphabet was made by Michael Mann and David Dalby, who had
attended the Niamey conference. It has 60 letters; some are quite different
from the 1978 version." and offers the linearized tilde, a tilde squeezed
into the space and location of the normal lowercase 'x' or 'o'. (See
https://commons.wikimedia.org/wiki/File:Latin_letter_Linearized_tilde_(Mann-Dalby_form).svg
)The
German WP article specifically says "Der Buchstabe ist in keine aktuelle
Orthografie  übernommen und ist
auch nicht in Unicode  enthalten
(Stand 2013, Unicode Version 6.3)." "The letter is not included in any
current spelling and is not included in Unicode." Should it be?


Re: 0027, 02BC, 2019, or a new character?

2018-01-24 Thread David Starner via Unicode
On Wed, Jan 24, 2018 at 6:31 PM Shriramana Sharma via Unicode <
unicode@unicode.org> wrote:

>
> On 23-Jan-2018 10:03, "James Kass via Unicode" 
> wrote:
>
> (bottle, east, skier, crucial, cherry)
> s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e
> sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe
> s̈ïs̈a, s̈yg̈ys, s̈an̈g̈ys̈y, s̈es̈üs̈i, s̈ïïe
> śíśa, śyǵys, śańǵyśy, śeśúśi, śííe
>
>
[...]


>
> I retract my earlier statement about digraphs probably being the best
> option. It was made without looking at the actual requirement. For such
> heavy usage, it would simply make things horrible.
>

I'd say that the words chosen for this discussion have been specifically
chosen for their heavy usage. Wikipedia has a translation of "All human
beings are born free and equal in dignity and rights. They are endowed with
reason and conscience and should act towards one another in a spirit of
brotherhood.", in what I believe in the new apostrophe-laden orthography:

Barlyq adamdar tu'masynan azat ja'ne qadyr-qasi'eti men quqtary ten' bolyp
du'ni'ege keledi. Adamdarg'a aqyl-parasat, ar-ojdan berilgen, sondyqtan
olar bir-birimen tu'ystyq, bau'yrmaldyq qarym-qatynas jasau'lary ti'is.

It's not that bad, though apostrophes still aren't a orthographic win. I'm
voting for the Uniform Turkic Alphabet, for the grand total of zero my vote
is worth.


Re: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-30 Thread David Starner via Unicode
On Tue, Jan 30, 2018 at 2:23 AM Alastair Houghton via Unicode <
unicode@unicode.org> wrote:

> This pattern exists across the board at the two companies; the Windows API
> hasn’t changed all that much since Windows NT 4/95, whereas Apple has
> basically thrown away all the work it did up to Mac OS 9 and is a lot more
> aggressive about deprecating and removing functionality even in Mac OS
> X/macOS than Microsoft ever was.
>

I'm not really clear on all the Windows details, as a long time Linux
programmer, but Mac OS X (2001) was 16 years ago and Windows 95 (1995) is
22, so not much difference even taking your numbers. The .NET framework
debuted in 2002, and the Universal Windows Platform debuted with Windows 8
in 2012, so Microsoft has made some pretty large changes since NT 4. They
do seem to more focused on keeping backwards compatibility layers, but it's
not that they've been not "prepared to make radical changes".


Re: Why so much emoji nonsense?

2018-02-14 Thread David Starner via Unicode
On Wed, Feb 14, 2018 at 12:55 AM Erik Pedersen via Unicode <
unicode@unicode.org> wrote:

> Dear Unicode Digest list members,
>
> Emoji, in my opinion, are almost entirely outside the scope of the Unicode
> project. Unlike text composed of the world’s traditional alphabetic,
> syllabic, abugida or CJK characters, emoji convey no utilitarian and
> unambiguous information content. Let us, therefore, abandon Emoji support
> in Unicode as a project that failed. If corporations want to maintain
> support for Emoji, let’s require them to use only the Private Use Area and,
> henceforth, confine Unicode expansion to attested characters from so far
> unsupported scripts.
>

Because ' has so much unambiguous information content. Or even just c.
(What's the phonetic value of that letter? Okay, I'll be "easy" on you;
what's the phonetic value of that letter in English? What about e?)

Also, who are the full members of Unicode?
http://www.unicode.org/consortium/members.html says Google, Apple, Huawei,
Facebook, Microsoft, etc. By show of hands, who wants a substantial part of
the user's data to become incompatible? I think they just voted this down.

Even ignoring that, this road has been crossed. Unicode will not tear out
anything, but if they could, people could probably survive Cuneiform or
Linear A going by the wayside. A not insubstantial part of the Unicode data
in the world includes emoji, and removing it would break everything. Like
many standards before that were radical changes, a new Unicode standard
without emoji would be dead in the water, and someone else would create a
competing back-compatible character standard and everyone would forget
about Unicode® and start using The One CCS®. It's like demanding that C use
bounds checking on its arrays, or that "island" go back to being spelled
"iland" now that we recognize it's not related to "isle". Even if mistakes
were made, they were carved into stone, and going back is not an option.


Re: Why so much emoji nonsense?

2018-02-14 Thread David Starner via Unicode
On Wed, Feb 14, 2018 at 11:16 AM James Kass via Unicode 
wrote:

> That's one way of looking at it.  Another way would be that the emoji
> were definitely outside the scope of the Unicode project as encoding
> them violated Unicode's initial encoding principles.
>

They were characters being interchanged as text in current use. They are
more inside the scope than many of the line-drawing characters for 8-bit
computers that have been there since day one, and analogous to many of the
dingbats that have also been there since day one.


Re: Why so much emoji nonsense?

2018-02-14 Thread David Starner via Unicode
On Wed, Feb 14, 2018 at 2:35 PM James Kass via Unicode 
wrote:

> David Starner wrote,
>
> > They were characters being interchanged as text
> > in current use.
>
> They were in-line graphics being interchanged as though they were
> text.  And they still are.  And we still disagree.
>

They were units of things being interchanged in formats of MIME types
starting with text/ . From the beginning, Unicode has supported all the
cruft that's being interchanged in formats of MIME types starting with
text/.


Re: metric for block coverage

2018-02-17 Thread David Starner via Unicode
On Sat, Feb 17, 2018 at 3:30 PM Adam Borowski via Unicode <
unicode@unicode.org> wrote:
> þ or ą count the same as LATIN TURNED CAPITAL
LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.

þ is in Latin-1, and ą is in Latin-A; the first is essential, even in its
marginal characters, and the second is pretty consistently useful in the
modern world. I don't see the problem or solution here; if something
supports a good chunk of the Arabic block, then it supports Arabic, and if
you need Persian and it supports Urdu instead, or vice versa, that's no
comfort.

Too bad, that wouldn't work for symbols, or for dead scripts: a good runic
> font will have a complete coverage of elder futhark, anglo-saxon, younger
> and medieval, while only a completionist would care about franks casket or
> Tolkien's inventions.
>

Where as I might guess that the serious users of Tolkien's runic might
rival or outnumber the users of the scripts for other purposes; after all,
Anglo-Saxon and other languages that appeared in Runic all have standard
Latin orthographies that are more suitable for scholarly purposes.


Re: metric for block coverage

2018-02-18 Thread David Starner via Unicode
On Sun, Feb 18, 2018 at 3:42 AM Adam Borowski  wrote:

> I probably used a bad example: scripts like Cyrillic (not even Supplement)
> include both essential letters and those which are historic only or used by
> old folks in a language spoken by 1000, who use Russian (or English...) for
> all computer use anyway -- all within one block.
>
> What I'm thinking, is that a beautiful font that covers Russian, Ukrainian,
> Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users before
> one whose only grace is including every single codepoint.
>

I'm not sure what your goal is. Opening up gucharmap shows me that
FreeSerif and Noto Serif both have complete coverage of Cyrillic and
Cyrillic Supplemental. We have reasonable fonts to offer users that cover
everything Cyrillic, or pretty much any script in use. I'm not sure where
and how you're trying to cut a line between a beautiful multilingual font
and a workable full font.

Ultimately, when I look at fonts, I look for Esperanto support. I'd be a
little surprised if it didn't come with Polish support, but it's unlikely
to be my problem. A useful feature for a font selector for me would be able
to select English, German, and Esperanto and get just the fonts that
support those languages (in an extended sense, including the extra-ASCII
punctuation and accents English needs, for example.) It does me absolutely
no good to know that it has "good, but not complete" Latin-A support.
Likewise, if you're a Persian speaker, knowing that the Arabic block has
"good, but not complete" support is worthless.

For single language ancient scripts, like Ancient Greek, then virtually any
font with decent coverage should cover the generally useful stuff. For more
complex ancient scripts, it pretty much has to be on a per language matter.
For some ancient scripts, like Runic and Old Italic, I understand that
after unifying the various writings, most people feel a language-specific
font is necessary for any serious work.

The ultimate problem is that the question is will it support my needs.
Language can often be used as a proxy, but names can often foil that. And
symbols are worse; € is the only character from Currency Symbols that's
used in an extended work in many, many instances, but so is ₪. Percentage
of block support is minimally helpful. Miscellaneous symbols lives up to
its name; ⛤, ⚇, ♷, ♕, and ☵ are all useful symbols, but not likely to be
found in the same work. Again, recommend 100% coverage or do the manual
work of separating them into groups and offering a specific font (game,
occult, etc.) that covers it, but messing around with a beautiful font with
less than 100% coverage versus a decent font with 100% coverage seems
counterproductive.

Not sure if I understand your advice right: you're recommending to ignore
> all the complexity and going with just raw count of in-block coverage?
> This could work: a released font probably has codepoints its author
> considers important.
>

I guess separating out by language when you need to is going to be the way
that helps people the most. Where that's most complex, I'm not sure why
you're not just offering a decent 100% coverage font (which Debian has a
decent selection of) and stepping back.


Re: 0027, 02BC, 2019, or a new character?

2018-02-21 Thread David Starner via Unicode
On Wed, Feb 21, 2018 at 9:40 AM John W Kennedy via Unicode <
unicode@unicode.org> wrote:

> “Curmudgeonly” is a perfectly good English word attested back to 1590.
>

Curmudgeony may be identified as misspelled by Google, but it's got a bit
of usage dating back a hundred years. Wiktionary's entry at [[-y]] says
"This suffix is still very productive and can be added to almost any
word.", and that matches my feeling that this is a perfectly good word, a
perfectly wordy word, even if it wouldn't be used in formal English.


Re: Suggestions?

2018-02-21 Thread David Starner via Unicode
On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode <
unicode@unicode.org> wrote:

> Where can I post suggestions and feedback for Unicode?
>

Here is as good as any place. There are specific places for a few specific
things, but likely if you do have something that's likely to get changed,
you'll need the help of someone here to get through the process. It is a
quarter-century old technical standard embedded in most electronics, so I
would temper any expectations for major changes; it works the way it works
because that's the way previous versions worked, and nobody is interested
in the trouble changing them would involve.


Re: 0027, 02BC, 2019, or a new character?

2018-01-23 Thread David Starner via Unicode
On Tue, Jan 23, 2018 at 10:55 AM Doug Ewell via Unicode 
wrote:

> I think it's so cute that some of us think we can advise Nazarbayev on
> whether to use straight or curly apostrophes or accents or x's or
> whatever. Like he would listen to a bunch of Western technocrats.
>

Kazakh has a perfectly servicable alphabet right now, that they probably
have a bunch of keyboards that work for it. And I'm sure there's some
Turkish firm that would be happy to deliver Turkish keyboards in bulk at
quite reasonable prices. There's reasons why they're changing to an ASCII
Latin script, and they're connected to the reasons he might listen to
Western technocrats.


Re: Encoding italic (was: A last missing link)

2019-01-16 Thread David Starner via Unicode
On Wed, Jan 16, 2019 at 7:41 PM James Kass via Unicode 
wrote:

> Computer text tradition aside, nobody seems to offer any legitimate
>
reason why such information isn't worthy of being preservable in
> plain-text.  Perhaps there isn't one.
>

Worthy of being preservable? Again, if you want rich text, you know where
to find it. Maybe italics could have been encoded in plain text, even as
late as 1991. But more than a quarter century on, everything supports
italics with a few rare exceptions. You're changing everything at a very
low level for a handful of systems.

On the other hand, tradition matters. Again, at the bottom of this email
I'm drafting is "*B* *I* *U* | *A*▼ tT▼|▼"; that is, bold, italics,
underline, text color, text size, and extra options, like font choice and
lists. Even non-computer geeks are familiar with that distinction. What's
the advantage of moving one feature into Unicode and breaking the symmetry?

On the other hand, most people won't enter anything into a tweet they can't
enter from their keyboard, and if they had to, would resort to cut and
paste. The only people Unicode italics could help without change are people
who already can use mathematical italics. If you don't have buy-in from
systems makers, people will continue to lack practical access to italics in
plain text systems.


Re: Encoding italic (was: A last missing link)

2019-01-20 Thread David Starner via Unicode
On Sun, Jan 20, 2019 at 2:57 PM James Kass via Unicode 
wrote:

> At which time it would only become a moot point for Twitter users.
> There's also Facebook and other on-line groups.  Plus scholars and
> linguists.  And interoperability.
>

How do you envision this working? In practice, English is still often
limited to ASCII, because smart quotes and dashes aren't on the top-level
of the keyboard, nor are accented characters. Adding italics to Unicode
isn't going to change much if input tools don't support it, and keyboards
aren't likely to change. Twitter and Facebook aren't going to change much
if the apps and webpages don't provide a tool to mark italics.

I don't see scholars and linguists demanding this. Scholars use markup
languages that can annotate the details they need annotated, far more than
just italics. Various dialects of SGML, XML and TeX do the job, not plain
text.

You've yet to demonstrate that interoperability is an actual problem.
Modern operating systems have ways of copying rich text including italics
around.  Maybe it would have been better to have standardized rich text,
either in Unicode or in a standard layer above Unicode, back in 1991. But
that train has left; you're just going to complicate systems that currently
handle and exchange rich text including italics.

To expand on what Mark E. Shoulson said, to add new italics characters,
you're going to need to not only copy all of Latin, but also Cyrillic (and
reopen the whole Macedonian italics argument, where б, г, д, п, and т are
all different in italics from in Russian). But also, Chinese is sometimes
put in italics (cf.
http://multilingualtypesetting.co.uk/blog/chinese-italics-oblique-fonts/ )
even if that horrifies many people. That page argues for, among other
solutions, using what's effectively bold instead of italics. So we're
talking about reencoding all of Chinese at least once (for emphasis) or
twice (for italics and bold). That's a clear no-go.


Re: A last missing link for interoperable representation

2019-01-14 Thread David Starner via Unicode
On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode  wrote:
> The arguments against italics seem to be:
>
> ·Unicode is plain text. Italics is rich text.
>
> ·We haven't had it until now, so we don't need it.
>
> ·There are many rich text solutions, such as html.
>
> ·There are ways to indicate or simulate italics in plain text 
> including using underscore or other characters, using characters that look 
> italic (eg math), etc.
>
> ·Adding Italicization might break existing software
>
> ·The examples of  existing Unicode characters that seem to represent 
> rich text (emoji, interlinear annotation, et al) have justifications.

There generally shouldn't be multiple ways of doing things. For
example, if you think that searching for certain text in italics is
important, then having both HTML italics and Unicode italics are going
to cause searches to fail or succeed unexpectedly, unless the
underlying software unifies the two systems (an extra complexity).
Searching for certain italicized text could be done today in rich text
applications, were there actual demand for it.

> ·Plain text still has tremendous utility and rich text is not always 
> an option.

Where? Twitter has the option of doing rich text, as does any closed
system. In fact, Twitter is rich text, in that it hyperlinks web
addresses. That Twitter has chosen not to support italics is a choice.
If users don't like this, they could go another system, or use
third-party tools to transmit rich text over Twitter. The use of
underscores or   markings for italics would be mostly
compatible with human twitterers using the normal interface.

Source code is an example of plain text, and yet adding italics into
comments would require but a trivial change to editors. If the user
audience cared, it would have been done. In fact, I suspect there
exist editors and environments where an HTML subset is put into
comments and rendered by the editors; certainly active links would be
more useful in source code comments than italics.

Lastly, the places where I still find massive use of plain text are
the places this would hurt the most. GNU Grep's manpage shows no sign
that it supports searching under any form of Unicode normalization.
Same with GNU Less. Adding italics would just make searching plain
text documents more complex for their users. The domain name system
would just add them to the ban list, and they'd be used for spoofing
in filenames and other less controlled but still sensitive
environments.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Encoding italic (was: A last missing link)

2019-01-15 Thread David Starner via Unicode
On Tue, Jan 15, 2019 at 1:47 PM James Kass via Unicode 
wrote:

>
> Although there probably isn't really any concerted effort to "keep
> plain-text mediocre", it can sometimes seem that way.
>

Dennis Ritchie allegedly replied to requests for new features in C with “If
you want PL/I, you know where to find it.” C is still an austere language,
and still well used, with users who want C++ or Java knowing where to find
them. If you want all the features of rich text, use rich text.

Avant-garde enthusiasts are on the leading edge by definition. That's
> why they're known as trend setters.  Unicode exists because
> forward-looking people envisioned it and worked to make it happen.
> Regardless of one's perception of exuberance, Unicode turned out to be
> so much more than a fringe benefit.
>

Unicode exists because large corporations wanted to sell computers to users
around the world, and found supporting a million different character sets
was costly and buggy, and that users wanted to mix scripts in ways that a
single character set didn't support and ISO 2022 and similar solutions just
weren't cutting it.

That's a clear user story. People can use italics on computers without
problem. Twitter has chosen not to support italics on their platform, which
users have found hacky work-arounds for. That's not such a clear user
story; shouldn't Twitter add support for italics instead of changing every
system in the world?


Re: Encoding italic

2019-01-15 Thread David Starner via Unicode
On Tue, Jan 15, 2019 at 5:17 PM James Kass via Unicode 
wrote:

> Enabling plain-text doesn't make rich-text poor.
>

Adding italics to Unicode will complicate the implementation of all rich
text applications that currently support italics.


> People who regard plain-text with derision, disdain, or contempt have
> every right to hold and share opinions about what plain-text is *for*
> and in which direction it should be heading.  Such opinions should
> receive all the consideration they deserve.
>

Really? There's no one here regards plain text with derision, disdain or
contempt. I might complain about the people who claim to like plain text
yet would only be happy with massive changes to it, though.

However, plain text can be used standalone, and it can be used inside
programs and other formats. Dismissing the people who use Unicode in ways
that aren't plain text is unfair and hurts your case.


Re: A last missing link for interoperable representation

2019-01-14 Thread David Starner via Unicode
On Mon, Jan 14, 2019 at 5:58 PM Mark E. Shoulson via Unicode
 wrote:
> *If* the VS is ignored by searches, as apparently it should be and some
> have reported that it is, then VS-type solutions would NOT be a problem
> when it comes to searches

Who is using VS-type solutions? I could not enter except for manually
using some sort of \u notations. Languages that need special input
support can easily adapt to unusual rules, but English Unicode is
weirdly hard to enter, because the QWERTY keyboard is ubiquitous and
standard. Smart quotes, non-HYPHEN-MINUS hyphens and dashes, and
accents generally need memorizing of obscure entry methods or resort
to a character list. Without great support from vendors, a new Unicode
italic system only going to be used by the same people who currently
use mathematical italics.

> (and don't go whining about legacy software.
> If Unicode had to be backward-compatible with everything we wouldn't
> have gone beyond ASCII).

Then where's this plain text that absolutely needs italics? Those
legacy software systems are the place where unadorned plain text still
lives. Anything on the Web is inherently dealing with rich text.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: Encoding italic

2019-01-21 Thread David Starner via Unicode
On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode
 wrote:
>  Even though /we/ know how to do
> it and have software installed to help us do it.

You're emailing from Gmail, which has support for italics in email.
The world has, in general, solved this problem.

>  > How do you envision this working?
>
> Splendidly!  (smile)  Social platforms, plain-text editors, and other
> applications do enhance their interfaces based on user demand from time
> to time.  User demand, at least on Twitter, seems established.

Then it would take six months, tops, for Twitter to produce and
release a rich-text interface for Twitter. Far less time than waiting
for Unicode to get around to it.

> When corporate
> interests aren't interested, third-party developers develop tools.

Where are these tools? As I said, third-party developers could develop
tools to convert a _underscore_ or /slash/ style italics to real
italics and back without waiting on Twitter or Unicode.

> Copy/pasting from a web page into a plain-text editor removes any
> italics content, which is currently expected behavior.  Opinions differ
> as to whether that represents mere format removal or a loss of meaning.
> Those who consider it as a loss of meaning would perceive a problem with
> interoperability.

Copy/pasting from a web page into a plain-text editor removes any
pictures and destuctures tables, which definitely loses meaning.

It also removes strike-out markup, which can have an even more
dramatic effect on meaning than removing italics. As you pointed out
below, it removes superscripts and subscripts; unless you wish to
press for automatic conversion of those to Unicode, that's going to
continue happening. It drops bold and font changes, and any number of
other things that can carry meaning.

> Copy/pasting an example from the page into plain-text results in “ma1,
> ma2, ma3, ma4”, although the web page displays the letters as italic and
> the digits as (italic) superscripts.  IMO, that’s simply wrong with
> respect to the superscript digits and suboptimal with respect to the
> italic letters.

The superscripts show a problem with multiple encoding; even if you
think they should be Unicode superscripts, and they look like Unicode
superscripts, they might be HTML superscripts. Same thing would happen
with italics if they were encoded in Unicode.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Encoding italic

2019-01-16 Thread David Starner via Unicode
On Tue, Jan 15, 2019 at 10:19 PM James Kass via Unicode
 wrote:
> Would there be any advantages to rich-text apps if italics were added to
> Unicode?  Is there any cost/benefit data?  You've made an assertion
> about complication to rich-text apps which I can neither confirm nor refute.

It's trivial; virtually all rich-text apps support italics or
specifically don't support italics. Suddenly they have to unify
italics from the plain text with the higher level italics, or they
have to exclude italics from the input data.

> One possible advantage would be interoperability.  People snagging
> snippets of text from web pages or word processors and dropping data
> into their plain-text windows wouldn't be bamboozled by the unexpected.
> If computer text is getting exchanged, isn't it better when it can be
> done in a standard fashion?

Bamboozled by the unexpected? I think the expectations of those who
have plain-text windows (who are still watching silents, in a sense)
is that pasting data into them will not copy italics. As for more
common users, a quick websearch shows many examples of people
frustrated that they cut and paste something and details like bold and
italics were carried along. This also establishes that current systems
already allow rich text to be cut-and-pasted in a platform-specific
manner.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: Encoding italic

2019-01-23 Thread David Starner via Unicode
On Tue, Jan 22, 2019 at 4:18 PM Richard Wordingham via Unicode
 wrote:
> On Mon, 21 Jan 2019 00:29:42 -0800
> David Starner via Unicode  wrote:
>
> > The superscripts show a problem with multiple encoding; even if you
> > think they should be Unicode superscripts, and they look like Unicode
> > superscripts, they might be HTML superscripts. Same thing would happen
> > with italics if they were encoded in Unicode.
>
> But if one strips the mark-up out, and searching is then based on
> the collation elements of the text, then this is not a problem.
> Mathematical and ASCII capitals differ only at the identity level.

Searching is not the only problem. Copying the data will reveal the
same problem.

Not only that, there was a previous argument that searching with
Unicode italics would let you find titles of books and such separately
from other usage of the phrase. That's not going to work if they're
based on the collation elements and ignore the italics. Which also
brings up the question of, if this is so important, why can't we
search for italicized data in web pages right now? For anyone
interacting with a web-browser that folds searching, this will change
nothing, until if and when italics-sensitive searching is made
available by the web-browser, which is not depending on Unicode
supporting italics.

There are programs that extract titles from text files; I suspect the
programmers are most happy working with text formats that mark up
titles as titles, not italics. In systems that just mark up italics,
translating whatever form of italics marking is used is much easier
than separating italicized titles from other forms of italics.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: A last missing link for interoperable representation

2019-01-08 Thread David Starner via Unicode
On Tue, Jan 8, 2019 at 2:03 AM James Kass via Unicode 
wrote:

> The boundaries of plain text have advanced since the concept originated
> and will probably continue to do so.  Stress can currently be
> represented in plain text with conventions used in lieu of existing
> typographic practice.  Unicode can preserve texts created using the
> plain text kludges/conventions for marking stress, but cannot preserve
> printed texts which use standard publishing conventions for marking
> stress, such as italics.
>

Is there any way to preserve The Art of Computer Programming except as a
PDF or its TeX sources? Grabbing a different book near me, I don't see any
way to preserve them except as full-color paged reproductions. Looking at
one data format, it uses bold, italics, and inversion (white on black), in
sans-serif, serif and script fonts; certainly in lines like
"Treasure standard (+1 starknife)", offering "Treasure
standard (+1 starknife)" is completely insufficient.

Can some books be mostly handled with Unicode plain text and italics? Sure.
HTML can handle them quite nicely. I'd say even them will have headers that
are typographically distinguished and should optimally be marked in a
transcription.


Re: A last missing link for interoperable representation

2019-01-11 Thread David Starner via Unicode
Emoji were being encoded as characters, as codepoints in private use
areas. That inherently called for a Unicode response. Bidirectional
support is a headache; the amount of confusion and outright exploits
from them is way higher then we like.The HTML support probably doesn't
help that. However, properly mixing Hebrew and English (e.g.) is
pretty clearly a plain text problem.

There are terabytes of Latin text out there, most of it encoded in
formats that already support italics. Whereas emoji, encoded as
characters in a then limited number of systems, could be subsumed into
Unicode easily, much of that text will never be edited and those
formats will never exclude the existing means of marking italics out
of bounds, offering multiple ways to do italics in perpetuity.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: A last missing link for interoperable representation

2019-01-13 Thread David Starner via Unicode
On Sat, Jan 12, 2019 at 8:26 PM James Kass via Unicode
 wrote:
> It's subjective, really.  It depends on how one views plain-text and
> one's expectations for its future.  Should plain-text be progressive,
> regressive, or stagnant?  Because those are really the only choices.
> And opinions differ.
>
> Most of us involved with Unicode probably expect plain-text to be around
> for quite a while.  The figure bandied about in the past on this list is
> "a thousand years".  Only a society of mindless drones would cling to
> the past for a millennium.  So, many of us probably figure that
> strictures laid down now will be overridden as a matter of course, over
> time.

And yet you write this in the Latin script that's been around for a
couple millennia. Arabic, Han ideographs, Cyrillic and Devanagari have
all been around a millennia.

Looking back at the history of computing, a large chunk of the
underlying technology has hit stability. ARM chips, x86 chips, Unix,
and Windows have all been around since 1985 or before, roughly 35
years ago and 35 years since the first programmed computer. They
aren't wildly changing. Unicode is moving towards that position; it
does a job and doesn't need disrupt changes to continue to be
relevant.

> Unicode will probably be around for awhile, but the barrier between
> plain- and rich-text has already morphed significantly in the relatively
> short period of time it's been around.

Fixed pictures have been parts of character sets for decades and were
part of Unicode 1.1. U+2704, WHITE SCISSORS, for example. And emoji
aren't disruptive in the way that moving something that's been a part
of the rich-text layer forever into the plain-text layer.

> I became attracted to Unicode about twenty years ago.  Because Unicode
> opened up entire /realms/ of new vistas relating to what could be done
> with computer plain text.  I hope this trend continues.

The right tool for the job. If you need rich text, you should use rich
text. Emoji had to make the case that they were being used as
characters and there were no competing tools to handle them.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: A last missing link for interoperable representation

2019-01-13 Thread David Starner via Unicode
On Sun, Jan 13, 2019 at 7:03 PM Martin J. Dürst via Unicode
 wrote:
> No, the casing idea isn't actually a dumb one. As Asmus has shown, one
> of the best ways to understand what Unicode does with respect to text
> variants is that style works on spans of characters (words,...), and is
> rich text, but thinks that work on single characters are handled in
> plain text. Upper-case is definitely for most part a single-character
> phenomenon (the recent Georgian MTAVRULI additions being the exception).

I would disagree; upper case is normally used in all caps or
title-case, and the latter is used on a word, not a character.

I don't argue that Unicode is wrong for handling casing the way it
does, but it does massively complicate the processing of any Latin
text; virtually all searches should be case-insensitive, for example.
At least in English, computerized casing will always be problematic.

> UPPER CASE can be used on whole spans of text, but that's not the main
> use case. And if UPPER CASE is used for emphasis, one way to do it (and
> the best way if this is actually a styling issue) is to use rich text
> and mark it up according to semantics, and then use some styling
> directive (e.g. CSS text-transform: uppercase) to get the desired look.

That's an example of how having multiple systems makes things more
complex and less consistent. If something can be written as all upper
case with the caps lock key, it will be. If a generated HTML file can
have uppercase added with a Python or SQL function, it probably will
be. Using CSS text-transform may be best practice, but simpler plain
text solutions will be used in a lot of cases and nothing can be
extrapolated clearly from its use or lack of use.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: A last missing link for interoperable representation

2019-01-09 Thread David Starner via Unicode
On Tue, Jan 8, 2019 at 11:58 PM James Kass via Unicode 
wrote:

>
> David Starner wrote,
>
>  > Can some books be mostly handled with Unicode plain text
>  > and italics? Sure. HTML can handle them quite nicely. ...
>
> Yes, many books can be handled very well with HTML using simple
> mark-up.  If I were producing a computer file to reproduce an old
> fiction novel, that's how I'd do it.  Not because it's better or simpler
> than plain text, but because it can't really be done in plain text at
> this time.  But if a section of the text is copy/pasted from the screen
> into an editor, some of the original information may be lost.
>

Looking at the Encyclopedia Brown book at hand, you'd lose any marking that
"The Case of the Headless Ghost" is the chapter header. While the picture
of the treasure chest may be gratuitous, but "he hung his sign outside the
garage:" is followed by an image of said sign that says "BROWN DETECTIVE
AGENCY...". If you copy/paste that without carrying the original image
along, some of the original information will be lost.

In the Gmail editor, I see buttons to make the text bold, italic, or
underlined, and to change the color, text size and font. English users tend
to see italics as part and parcel of the text formatting. One can argue
that's part of history, that italics is somehow different from bold and
underline and font and text size changes, but when the standard perception
conveniently matches how Unicode encodes the script, there doesn't seem
much point in changing things, especially with terabytes of text that
encodes italics separately from the plain text matter.

Frequently, copy/pasting material does preserve non-plain text features; if
I paste a title from Wikipedia into here, it will show up much larger then
the rest of the text. It's a pain, because I want the underlying text, not
how it was displayed in the context.

Honestly, I could argue that case should not be encoded. It would simplify
so much processing of Latin script text, and most of the time
case-sensitive operations are just wrong. Case is clearly a headache that
has to be dealt with in plain text, but it certainly doesn't encourage me
to add another set of characters that are basically the same but not.


Re: Encoding italic

2019-02-09 Thread David Starner via Unicode
On Sat, Feb 9, 2019 at 3:59 AM Kent Karlsson via Unicode <
unicode@unicode.org> wrote:

>
> Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode"  >:
> > • Reverse on: ESC [7m
> > • Reverse off: ESC [27m
>
> "Reverse" = "switch background and foreground colours".
>
> This is an (odd) colour thing. If you want to go with (full!) colour
> (foreground and background), fine, but the "reverse" is oddball (and
> based on what really old terminals were limited to when it comes to
> colour).
>

Note that this is actually the only thing that stands out to me in Unicode
not supporting older character sets; in PETSCII (Commodore 64), the
high-bit character characters were the reverse (in this sense) of the
low-bit characters.


Re: Encoding italic

2019-01-25 Thread David Starner via Unicode
On Thu, Jan 24, 2019 at 11:16 PM Tex via Unicode  wrote:
> Twitter was offered as an example, not the only example just one of the most 
> ubiquitous. Many messaging apps and other apps would benefit from italics. 
> The argument is not based on adding italics to twitter.

And again, color me skeptical. If italics are just added to Unicode
and not to the relevant app or interface, they will not see much use,
in the same way that most non-ASCII characters for proper English--the
quotes, the dashes, the accents--are often ignored because they're too
hard to enter. But if you're going to add italics, having it in
Unicode doesn't make it significantly easier, particularly when they
need to support systems that predate Unicode adding italics.

> The biggest burden would be to the apps that would benefit, to add 
> italicizing and editing capabilities.

If they would benefit or if they'd accept the burden, they'd have
already added italics, via HTML or Markdown or escape sequences or
whatever.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: Encoding italic

2019-01-31 Thread David Starner via Unicode
On Thu, Jan 31, 2019 at 12:56 AM Tex  wrote:
>
> David,
>
> "italics has never been considered part of plain text and has always been 
> considered outside of plain text. "
>
> Time to change the definition if that is what is holding you back.

That's not a definition; that's a fact. Again, it's like the 8-bit
byte; there are systems with other sizes of byte, but you usually
shouldn't worry about it. Building systems that don't have 8-bit bytes
are possible, but it's likely to cost more than it's worth.

> As has been said before, interlinear annotation, emoji and other features of 
> Unicode which  are now considered plain text were not in the original 
> definition.

https://www.w3.org/TR/unicode-xml/#Interlinear (which used to be
Unicode Technical Report #20) says "The interlinear annotation
characters were included in Unicode only in order to reserve code
points for very frequent application-internal use. ... Including
interlinear annotation characters in marked-up text does not work
because the additional formatting information (how to position the
annotation,...) is not available. ... The interlinear annotation
characters are also problematic when used in plain text, and are not
intended for that purpose."

Emoji, as have been pointed out several times, were in the original
Unicode standard and date back to the 1980s; the first DOS character
page has similes at 0x01 and 0x02.

> If Unicode encoded an italic mechanism it would be part of plain text, just 
> as the many other styled spaces, dashes and other characters have become 
> plain text despite being typographic.

If Unicode encoded an italic mechanism, then some "plain text" would
include italics. Maybe it would be successful, and maybe it would join
the interlinear annotation characters as another discouraged poorly
supported feature.

> As with the many problems with walls not being effective, you choose to 
> ignore the legitimate issues pointed out on the list with the lack of italic 
> standardization for Chinese braille, text to voice readers, etc.

Text to voice readers don't have problems with the lack of italic
standardization; they have problems with people using mathematical
characters instead of actual letters.

> The choice of plain text isn't always voluntary.

The choice of using single-byte character sets isn't always voluntary.
That's why we should use ISO-2022, not Unicode. Or we can expect
people to fix their systems. What systems are we talking about, that
support Unicode but compel you to use plain text? The use of Twitter
is surely voluntary.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: Encoding italic

2019-01-31 Thread David Starner via Unicode
On Wed, Jan 30, 2019 at 11:37 PM James Kass via Unicode
 wrote:
> As Tex Texin observed, differences of opinion as to where we draw the
> line between text and mark-up are somewhat ideological.  If a compelling
> case for handling italics at the plain-text level can be made, then the
> fact that italics can already be handled elsewhere doesn’t matter.  If a
> compelling case cannot be made, there are always alternatives.

To the extent I'd have ideology here, it's that that line is arbitrary
and needs to fit practical demands. Should we have eight-bit bytes?
I'm not sure that was the best solution, and other systems worked just
fine, but we've got a computing environment that makes anything else
unpractical. Unlike that question, italics has never been considered
part of plain text and has always been considered outside of plain
text. The fact that italics can be handled elsewhere very much weighs
against the value of your change. Everything you want to do can be
done and is being done, except when someone chooses not to do it.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Encoding italic

2019-01-30 Thread David Starner via Unicode
On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 wrote:
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text

Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: Unicode "no-op" Character?

2019-06-24 Thread David Starner via Unicode
On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode
 wrote:
> Which leads us to the key.  The desire is for a character that has no public 
> meaning, but has some sort of private meaning.  In other words it has a 
> private use.  Oddly enough, there is a group of characters intended for 
> private use, in the PUA ;-)

Who's private use? If you have a stream of data that is being
packetted for transmission, using a Private Use character is likely to
mangle data that is being transmitted at some point. A NUL is likely
to be the best option, IMO, since it's unlikely that anyone expects
that they can transmit a NUL through an arbitrary channel, unlike a
random private use character.

-- 
Kie ekzistas vivo, ekzistas espero.


Re: On the lack of a SQUARE TB glyph

2019-09-27 Thread David Starner via Unicode
On Thu, Sep 26, 2019 at 8:57 PM Fred Brennan via Unicode
 wrote:
> The purpose of Unicode is plaintext encoding, is it not? The square TB form is
> fundamentally no different than the square form of Reiwa, U+32FF ㋿, which was
> added in a hurry. The difference is that SQUARE TB's necessity and use is a
> slow thing which happened over years, not all of a sudden via one announcement
> of the Japanese government.

Defining whether a pair of characters gets squeezed into one square is
hardly a plaintext issue.

The square form of Reiwa is a bit different, given its use in printing
time, where there may have been an expectation that it takes up one
square. It's also a new member of a tiny set, as opposed to SQUARE TB,
which people have been using already in various ways.

> New emoji are still being encoded. The existence of SQUARE GB leads to its
> use, which then leads to people wanting SQUARE TB and resorting to hacks to
> get it done. If you didn't want people to request more square forms you
> shouldn't have encoded any at all. It's too late for that.

It's unlikely that not encoding wouldn't have stopped the requests
from coming, and it's not too late for them to dismiss those requests.

Unicode, in order to become the one character set, had to become
backward compatible with all the major legacy character sets out
there. Unicode has piles and piles of frustrating compromises because
of that, but it was felt that was the cost that had to be paid.

> There is no sequence of glyphs that could be logically mapped, unless you're
> telling me to request that the sequence T  B be recommended for general
> interchange as SQUARE TB? That's silly.

Why is that silly? You've got an unbounded set of these; even the base
prefixes EPTGMkhdmμnp (and da) crossed with bBmglWsAKNJCΩT (plus a
bunch more), which is over 200 combinations without all the units, and
there's some exponents encoded, so some of those will need to be
encoded with exponents. And that's far from a complete list of what
people might want as squares.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: comma ellipses

2019-10-07 Thread David Starner via Unicode
I still see the encoding of the original ellipsis as a mistake,
probably for compatibility with some older standard that included it
because the system wasn't smart enough to intelligently handle "..."
as ellipsis.

-- 
Kie ekzistas vivo, ekzistas espero.