from:"Asmus Freytag"

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Asmus Freytag via Unicode


  
  
On 2/12/2020 3:26 PM, Shawn Steele via
  Unicode wrote:


  
From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow.

  
  
That bar, to me, seems too low.  Many things are only used briefly or in a private context that doesn't really require encoding.

The term "use" clearly should be understood as "used in active
  public interchange".
From that point on, its gets tricky. Generally, in order to
  standardize something presupposes a community with shared, active
  conventions of usage. However, sometimes, what the community would
  like is to represent faithfully somebody's private convention, or
  some convention that's fallen out of use.
Such scenarios  may require exceptions to the general statement,
  but the distinction between truly ephemeral use and use that,
  while limited in time, should be digitally archivable in plain
  text is and always should be a matter of judgment.


  

The hieroglyphs discussion is interesting because it presents them as living (in at least some sense) even though they're a historical script.  Apparently modern Egyptologists are coopting them for their own needs.  There are lots of emoji for professional fields.  In this case since hieroglyphs are pictorial, it seems they've blurred the lines between the script and emoji.  Given their field, I'd probably do the same thing.

Focusing on the community of scholars (and any other current
  users) rather than the historical community of original users
  seems rather the appropriate thing to do. Whenever a modern
  community uses a historic script, new conventions will emerge.
  These may even include conventions around transcribing existing
  documents (because the historic communities had no conventions
  around digitizing their canon).


  

I'm not opposed to the character if Egyptologists use it amongst themselves, though it does make me wonder if it belongs in this set?  Are there other "modern" hieroglyphs?  (Other than the errors, etc mentioned earlier, but rather glyphs that have been invented for modern use).

I think the proposed location is totally fine. Trying to
  fine-tune a judgement about characters by placing them in specific
  way is a fools game. If needed, distinctions can be expressed via
  character properties.
A./




  

-Shawn

Re: Combining Marks and Variation Selectors

2020-02-02 Thread Asmus Freytag via Unicode


  
  
On 2/2/2020 5:22 PM, Richard Wordingham
  via Unicode wrote:


  On Sun, 2 Feb 2020 16:20:07 -0800
Eric Muller via Unicode  wrote:


  
That would imply some coordination among variations sequences on
different code points, right?

E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn,
ccc=0) would imply the existence of a variation sequence on 0B48 with
the same variation selector, and the same effect.

  
  
That particular case oughtn't to be impossible, as in NFD everything in
sight has ccc=0.  However TUS 12.0 Section 23.4 does contain an
additional prohibition against meaningfully applying a variation
selector to a 'canonical decomposable character'. (Scare quotes because
'ly' seems to be missing from the phrase.)

Richard.

So, let's look at what that would look like with some variation
  selector

<0B48, Fxxx> ≡ <0B47, 0B56, Fxxx>


If the variant in the shape of 0B48 is well-described by a
  variation on the contribution due to 0B56 in the decomposed
  sequence then this might make sense. But if the variant would be
  better described as a variation in the 0B47 component, then it
  would be a prime example of poor "pseudo encoding": where some
  random sequence is assigned to a a shape (in this case) without
  being properly analyzable into constituent characters with their
  own identity.
Which would it be in this example?
And this example only works, of course, because with ccc=0, 0B56
  cannot be reordered.
The prohibition as worded may perhaps be slightly more broad than
  necessary, but I can understand that the UTC didn't want to parse
  it more finely in the absence of any good examples that could be
  used to better understand what the actual limitations should be.
  Better safe than sorry, and all that.

A./


  


  
On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:
I don't think there is a technical reason for disallowing variation
selectors after any starters (ccc=000); the normalization algorithm
doesn't care about the general category of characters.

Mark

Twitter corrects Kwanzaa emoji

2019-12-27 Thread Asmus Freytag via Unicode


  
  
https://thehill.com/policy/technology/476086-social-media-users-call-out-twitter-over-kwanzaa-emoji

Re: NBSP supposed to stretch, right?

2019-12-18 Thread Asmus Freytag via Unicode

On 12/17/2019 5:49 PM, James Kass via
  Unicode wrote:

  Asmus Freytag wrote,

  > And any recommendation that is not compatible with what the
  overwhelming

  > majority of software has been doing should be ignored (or
  only enabled on

  > explicit user input).

  >

  > Otherwise, you'll just advocating for a massively breaking
  change.

  It seems like the recommendations are already in place and the
  “overwhelming majority of software” is already disregarding them.

so they are dead letter and should be deprecated...

  I don’t see the massively breaking change here.  Are there any
  illustrations?

  If legacy text containing NON-BREAK SPACE characters is popped
  into a justifier, the worst thing that can happen is that the text
  will be correctly justified under a revised application.  That’s
  not breaking anything, it’s fixing it.  Unlike changing the
  font-face, font size, or page width (which often results in
  reformatting the text), the line breaks are calculated before
  justification occurs.

  If a string of NON-BREAK SPACE characters appears in an HTML file,
  the browser should proportionally adjust all of those space
  characters identically with the “normal” space characters.  This
  should preserve the authorial intent.

  As for pre-Unicode usage of NON-BREAK SPACE, were there ever any
  exlicit guidelines suggesting that the normal SPACE character
  should expand or contract for justification but that the NON-BREAK
  SPACE must not expand or contract?

Re: NBSP supposed to stretch, right?

2019-12-17 Thread Asmus Freytag via Unicode


  
  
On 12/17/2019 11:31 AM, James Kass via
  Unicode wrote:

So it
  follows that any justification operation should treat NO-BREAK
  SPACE and SPACE identically.
And any recommendation that is not
compatible with what the overwhelming majority of software has
been doing should be ignored (or only enabled on explicit user
input).
Otherwise, you'll just advocating for a
massively breaking change.
NBSP has been supported since way before
Unicode. It's way past the point where we can legislate behavior
other than the de-facto consensus among implementations.
Now, if someone can show us that there are
widespread implementations that follow the above recommendation
and have no interoperability issues with HTML then I may change
my tune.
A./

Re: NBSP supposed to stretch, right?

2019-12-17 Thread Asmus Freytag via Unicode


  
  
On 12/17/2019 2:41 AM, Shriramana
  Sharma via Unicode wrote:


  
  

  
On Tue 17 Dec, 2019, 16:09
  QSJN 4 UKR via Unicode, 
  wrote:

Agree.
  By the way, it is common practice to use multiple nbsp in
  a row to
  create a larger span. In my opinion, it is wrong to
  replace fixed
  width spaces with non-breaking spaces.
  Quote from Microsoft Typography Character design
  standards:
  «The no-break space is not the same character as the
  figure space. The
  figure space is not a character defined in most computer
  system's
  current code pages. In some fonts this character's width
  has been
  defined as equal to the figure width. This is an incorrect
  usage of
  the character no-break space.»

  



Sorry but I don't understand how this addresses
  the issue I raised.
  

You don't?
In principle it may be true that NBSP is not
fixed width, but show me software that doesn't treat it that
way.
In HTML, NBSP isn't subject to space
collapse, therefore it's the go-to space character when you need
some extra spacing that doesn't disappear.
I bet, in many other environments it was
typically the only "other" space character, so it ended up
overloaded.
My hunch is that it is too late at this
point to try to promulgate a "clean" implementation of NBSP,
because it would effectively change untold documents
retroactively. So it would be a massively breaking change.
If you have a situation where you need
really poor layout (wide inter-word spaces) to justify, the fact
that a honorific in front of a name works more like it's part of
the same word (because the NBSP doesn't stretch) would be the
least of my worries. (Although, on lines where interword spaces
are a reduced a bit, I can see that becoming counter-intuitive).
If you only fix this in software for
high-end typography, you'd still have the issue that things will
behave differently if you export your (plain) text. And you
would have the issue of what to do when you want fixed spaces to
be non-breaking as well (is that ever needed?).
A./

Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Asmus Freytag via Unicode


  
  
On 11/19/2019 3:00 PM, Mark E. Shoulson
  via Unicode wrote:


  
  It says "foundation", not "sum total,
all there is."  I don't think this is much overreach.  MAYBE it
counts as "enthusiastic", but not misleading.
  
  
  Why so concerned with these minutiæ? 
Were you in fact misled?  (Doesn't sound like it.)  Do you know
someone who was, or whom you fear would be?  What incorrect
conclusions might they draw from that misunderstanding, and how
serious would they be?  Doesn't sound like this is really
anything serious even if you were right.



Anytime you need to stop and think: "can this be accurate?" you
  undermine the effectiveness of a message like that.
Amending the claim to limit it to "text", for example, would make
  it more directly applicable and therefore stronger.
A./


  
  
  ~mark
  
  
  
  On 11/19/19 1:59 PM, Costello, Roger
L. via Unicode wrote:
  
  




  Hi Folks,
   
  Today I received an email from the
Unicode organization. The email said this: (italics and
yellow highlighting are mine)
   
  The Unicode Standard is the 
foundation for all modern software and communications
around the world, including all modern operating
  systems, browsers, laptops, and smart phones—plus the
  Internet and Web (URLs, HTML, XML, CSS, JSON, etc.).
   
  That is a remarkable statement! But is it
entirely true? Isn’t it assuming that everything is text?
What about binary information such as JPEG, GIF, MPEG, WAV;
those are pretty core items to the Web, right? The Unicode
Standard is silent about them, right? Isn’t the above quote
a bit misleading?
   
  /Roger

Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Asmus Freytag via Unicode


  
  
On 11/19/2019 12:04 PM, Michael Everson
  via Unicode wrote:


  Of course it’s not “misleading”. Human language is best conveyed by text. 

One could insert the language in [ ] to make the claim sound less
  like an overreach.
It doesn't even impede the flow that much.
It would still apply to metadata and protocols.
A./


  

Michael Everson


  
On 19 Nov 2019, at 18:59, Costello, Roger L. via Unicode  wrote:

Hi Folks,
 
Today I received an email from the Unicode organization. The email said this: (italics and yellow highlighting are mine)
 
The Unicode Standard is the foundation for [handling written text in] all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones—plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.).
 
That is a remarkable statement! But is it entirely true? Isn’t it assuming that everything is text? What about binary information such as JPEG, GIF, MPEG, WAV; those are pretty core items to the Web, right? The Unicode Standard is silent about them, right? Isn’t the above quote a bit misleading?
 
/Roger

Re: New Public Review on QID emoji

2019-11-12 Thread Asmus Freytag via Unicode


  
  
On 11/12/2019 12:32 PM,
  wjgo_10...@btinternet.com via Unicode wrote:

>
  Just because you can write something that is a very detailed
  specification doesn't mean that it is, or ever should be, a
  standard.
  
  
  Yes, but that does not mean that it should necessarily not become
  a standard. For communication to take place one needs to start
  somewhere. The QID emoji proposal is a start. It has been
  considered at (at least) two Unicode Technical Committee meetings
  and now there is a public review taking place.

Just because a select group of people
engages in communication about the arcane details of a proposed
specification it doesn't mean that the outcome will benefit some
entirely different and larger group communicate better.
There's too much of the "might possibly"
about this; and it is quite different from the early days of
Unicode itself, where there was a groundswell of pent-up demand
for a solution to the fragmented character encoding landscape;
the discussions quickly became about the best way to do that,
and about how to ensure that the result would be supported.
The current effort starts from an unrelated
problem (Unicode not wanting to administer emoji applications)
and in my analysis, seriously puts the cart before the horse.
A./

Re: New Public Review on QID emoji

2019-11-12 Thread Asmus Freytag via Unicode


  
  
On 11/12/2019 8:41 AM,
  wjgo_10...@btinternet.com via Unicode wrote:

Asmus
  Freytag wrote as follows.
  
  
  While I have a certain understanding for
the underlying concerns, it still is the case that this proposal
promises to be a bad example of "leading standardization":
throwing out a spec in the hopes it may be taken up and take
off, instead of something that meets an expressed need of the
stakeholders and that they are eagerly awaiting.

  
  
  I suppose that it could be called "leading standardization" but I
  think that that is a good thing. Unicode has traditionally been
  locked into the past. If a symbol could be found carved in stone
  years ago than that was fine but anything for the future that
  could possibly become useful was a huge insuperable problem.
  
  
  Yet for me "could possibly become useful" is a good reason for
  encoding, and QID emoji opens up great futuristic possibilities.
  For me the big problem with the proposal at present are the
  restrictions upon which QID items are valid to become encoded as
  QID emoji. So anything abstract is locked out. That to me is an
  unnecessary restriction, yet it could easily be removed. Yet
  abstract shapes are important in communication.
  

If leading standardization was such a good thing in
  communication, why don't we see more "dictionaries of words not
  yet in use"? After all, it would be a huge benefit for people
  coining new terms to have their definitions already worked out.
  Nothing inherent in the technology of dictionaries has directly
  prevented overtures in that direction, but it overwhelmingly
  remains a path not taken.
One wonders why.




  
  I regard QID emoji as a research project. The specification may
  need some alterations, maybe it is just the start of a whole new
  path of exploration in communication, much wider than emoji. I am
  a researcher and I try to find what is good in an idea and focus
  on that and think where a new idea can lead, applying critical
  consideration of ideas, yet trying to move forward rather than
  seizing on problems found as a reason for dismissing the whole
  idea. So find the problems, try to think round them, try to go
  forward. Look for what could be done and if it is good, try to do
  it. Try to go forward rather than quash.
  



Research and standardization are both worthwhile endeavors, but
  they are fundamentally different in outlook. Standardization is
  about a community agreeing on a fixed common way of doing
  something. It inherently "squashes" other alternative ways of
  doing the same thing, in the interest of gaining the efficiencies
  inherent in having a single approach; even where it's
  theoretically not the best.




  
  That, then, finally undermines Unicode's
implied guarantee as being the medium for unambiguous
interchange. Giving up that guarantee seems a bad bargain.

  
  
  Many recent emoji encoding proposals seem to delight, as if
  required, in providing multiple meanings for each newly proposed
  character.
  
  
  There was a talk at the Unicode and Internationalization
  Conference a few years ago on what are the meanings of emoji. I
  was not there but there is a video available on YouTube.
  

Emoji, just like words, are amenable to idiomatic use. Such
  idiomatic use will always be at odds with whatever formal meaning
  (or dictionary definition) is associated with a character, and
  being such a fundamental aspect of how language functions, it's
  unlikely to be a passing phenomenon. I'm not sure that the people
  administering the standard have fully woken up to what that means,
  or what is required of emoji so that they can best function when
  used in this way.
Linking emoji to an open-ended set of supposedly well-defined
  semantic signifiers is simply adding a bigger dictionary, while
  removing the one key aspect of what written communication depends
  on: the guaranteed status of the written symbol as being in the
  shared canon of writing system elements, and one that is
  recognized by the recipients as such.
Having a huge set of potential semantic values that are unrelated
  to a specific shape and not guaranteed to be shared exacerbates
  the existing problems, rather than pointing to a fix. In that
  sense, the proposed approach is truly a solution in search of a
  question.
Just because you can write something that is a very detailed
  specification doesn't me

Re: New Public Review on QID emoji

2019-11-09 Thread Asmus Freytag via Unicode

On 11/9/2019 3:18 PM, Peter Constable
via Unicode wrote:

Neither Unicode Inc. or ISO/IEC 10646 would _implement_ QID emoji. Unicode would provide a specification for QID emoji that software vendors could implement, while ISO/IEC 10646 would not define that specification. As Ken mentions, there are already many emoji in use inter-operably based on specifications provided by Unicode but not by ISO/IEC 10646.

One of the bigger issues I have with this
proposal is that it is a specification that "vendors _could_
implement".

Let's not argue why that might technically
apply to other specifications; in this case it underscores the
fact that it is not the vendors that are asking for this, but
instead, the motivation appears to be the Unicode Consortium's
interest to not be seen as "arbiters" of new emoji.
While I have a certain understanding for the
underlying concerns, it still is the case that this proposal
promises to be a bad example of "leading standardization":
throwing out a spec in the hopes it may be taken up and take
off, instead of something that meets an expressed need of the
stakeholders and that they are eagerly awaiting.
The stakeholders in this case are not
limited to the vendors. They include the users as well. Having a
negotiated set of emoji, implies, on the one hand, a limitation.
But on the other hand it allows the necessary standardization of
the set; such that users can be comfortable in the expectation
that there is wide agreement in the set of supported emoji, so
that they are safe to use in interchange across platforms.
While there is considerable latitude in
representation, some emoji are beginning to be used
idiomatically, where they are not standing for the formally
adopted meaning, but often for something that is only implied
(or hinted at by common choice for the particular depiction of
an unrelated concept). Having a central clearinghouse provides a
platform that allows some push for commonality in depiction,
particularly in instances where it matters for idiomatic and
other common uses.
From the idiomatic use that certain emoji
have acquired, it follows that a simple "semantic"
identification is a solution in search of a problem: being able
to, in principle, differentiate between aubergine and eggplant
isn't of interest, given the way the eggplant emoji is commonly
used. Being able to identify a specific semantic to something
some vendor puts up on a keyboard, is also not as interesting as
the association that one can expect the recipient will make with
the displayed shape. And finally, the less likely it will be
that some carefully selected emoji will be received as such by
its intended recipient the less valuable it will be to the user.

Much of the strong interest in pushing for
the adoption of particular emoji is derived from the feeling of
validation of having a representative depiction or
representative object admitted to the 'canon'. Stakeholders for
which that is important aren't going to respond as well to a
non-committal, free-f0r-all approach.
While QID is a fancy label, de-facto these
would be "private-use" characters: on any given platform, some
set of QID emoji may be defined on the keyboard/palette and as a
users, I may be assured that inserting them for platform
internal use will work. The platform vendor may gain the benefit
of being able to use an "in-stream" encoding, but if they
already supported private emoji, they will have a legacy scheme
for them that they may or may not be able to abandon. For
cross-platform use or cut all bets are off: the
chances that out of millions of QIDs two vendors will support
the same one (even for overlapping concepts) is going to be
rather small. Lacking central endorsement, the extensions for
different vendors can be expected to drift apart to the point
where QID become useless for wider interchange.
But what if there were a consortium of
vendors, you say, who could coordinate these efforts? Well, gee,
I could think of a very successful consortium of vendors . . .
As for the underlying motivation of getting
out of the emoji business, I would say that emoji business has
had some not inconsiderable positive side effects. While much of
Unicode's work on natural scripts has been focused on writing
systems without a large contemporary user base, emoji have
represented a high-interest, common-use, and high-visibility
subset - and have probably done more than any

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode


  
  
On 10/13/2019 6:38 PM, Richard
  Wordingham via Unicode wrote:


  On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode  wrote:


  
On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
instead of formally handling NFD, you could extend the syntax to
handle "inherited" properties across combining sequences.

Am I missing anything?

  
  
Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match . 

Why does it matter if it is precomposed? Why should it? (For
  anyone other than a character coding maven).


   Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both  and its canonical
equivalent .  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.


Some models may be more natural for different scripts. Certainly,
  in SEA or Indic scripts, most combining marks are not best modeled
  with properties as "inherited". But for L/G/C etc. it would be a
  different matter.
For general recommendations, such as UTS#18, it would be good to
  move the state of the art so that the "primitives" are in line
  with the way typical writing systems behave, so that people can
  write "linguistically correct" regexes.
A./




  
Regards,

Richard.

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode


  
  
On 10/13/2019 2:54 PM, Richard
  Wordingham via Unicode wrote:


  Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Formally, wouldn't that be rewriting \p{Lu}
to match \p{Lu}\p{Mn}*; instead of formally handling NFD, you
could extend the syntax to handle "inherited" properties across
combining sequences.
Am I missing anything?
A./

Re: Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji))

2019-10-12 Thread Asmus Freytag via Unicode


  
  
On 10/12/2019 1:16 AM, Daniel Bünzli
  via Unicode wrote:


  With all due respect for the work that has been done on the new website I think that the new structure significantly decreased the usability of the website for technical users.

^^^  This  (unfortunately).
A./

Re: Fwd: The Most Frequent Emoji

2019-10-11 Thread Asmus Freytag via Unicode


  
  
Sidebar looks same as on other pages
  for me. Don't like the design, but that's a different issue.


Now, stuff on the bottom: the line with
  the "terms of use" is at least one font size too small. Esp. if
  the terms of use are supposed to be a clickable link.



You have to have pretty good eyesight
  to be able to read it - not sure how well that plays for required
  legal language.


I think that should be looked into and
  fixed  - but it's systemic.



A./



On 10/10/2019 11:46 PM, Martin J. Dürst
  via Unicode wrote:


  I had a look at the page with the frequencies. Many emoji didn't 
display, but that's my browser's problem. What was worse was that the 
sidebar and the stuff at the bottom was all looking weird. I hope this 
can be fixed.

Regards,   Martin.

 Forwarded Message 
Subject: The Most Frequent Emoji
Date: Wed, 09 Oct 2019 07:56:37 -0700
From: announceme...@unicode.org
Reply-To: r...@unicode.org
To: announceme...@unicode.org

Emoji Frequency ImageHow does the Unicode Consortium choose which new 
emoji to add? One important factor is data about how frequently the 
current emoji are used. Patterns of usage help to inform decisions about 
future emoji. The Consortium has been working to assemble this 
information and make it available to the public.

And the two most frequently used emoji in the world are...
 and ❤️
The new Unicode Emoji Frequency 
 page shows a list of 
the Unicode v12.0 emoji ranked in order of how frequently they are used.

“The forecasted frequency of use is a key factor in determining whether 
to encode new emoji, and for that it is important to know the frequency 
of use of existing emoji,” said Mark Davis, President of the Unicode 
Consortium. “Understanding how frequently emoji are used helps 
prioritize which categories to focus on and which emoji to add to the 
Standard.”


/Over 136,000 characters are available for adoption 
, to help the 
Unicode Consortium’s work on digitally disadvantaged languages./

[badge] 

http://blog.unicode.org/2019/10/the-most-frequent-emoji.html

Re: comma ellipses

2019-10-07 Thread Asmus Freytag via Unicode


  
  
On 10/6/2019 10:59 PM, David Starner
  via Unicode wrote:


  I still see the encoding of the original ellipsis as a mistake,
probably for compatibility with some older standard that included it
because the system wasn't smart enough to intelligently handle "..."
as ellipsis.



Agreed, a big part was "fixed width" fonts,
but the Asian variety where it may also have been baked into the
layout. However, now that the code point exists, it has been
integrated into the way fonts and applications handle layout.
Word, for example, appears to apply
auto-correct (or does in the older version running on the
machine I'm typing this on).
The point is, whatever the situation was in
the late 1980's that lead to the inclusion in Unicode in the
first grade isn't (can't be) the last word in defining this
character: Unicode isn't merely passively modeling, but via
users and implementers there's a feedback.
The practice seems to be that if you want a
typographically sound ellipsis you may key in three periods, but
what is stored is the code point for the ellipsis (and the
layout for "random" three periods is not adjusted). In any
applications that do not support that level of input support,
you get a typographically not perfect representation.
That's actually not as bad as it sounds,
because periods are so heavily overloaded that you'd want to be
a bit careful assuming (without user override) that three of
them are a true "ellipsis".
  
If there's no "typographically correct" form
for a "comma ellipsis" then there's no difference ever between
three of them and a comma ellipsis, and all further discussion
is moot. However, assume there's an assertion that three commas
need to be spaced differently if they are intended as a
typographically correctly rendered comma ellipsis.
Asking for software to handle that on the
fly (without the kind of override option provided by
auto-correct or other input support mapping this to an ellipsis
code point) would be wrong. One, because it assumes three commas
can never be anything else than a "comma ellipsis", and two,
because it would introduce a requirement that's at odds with how
implementers (or at least an significant portion) have chosen to
treat the 3-dot ellipsis.
There's even an argument that the whole
thing is on par with input support resolving two hyphens into an
en-dash and three into an m-dash, but making that subject to
user override (via mapping to dedicated code points) and not
simply by asserting special on-the-fly formatting.
(I also see little risk that there's a huge
set of other mutliple-punctuation sequences out there that could
make a legitimate claim to be encoded, so treating ellipsis as a
precedent does not promise to eat up code space by the
plane-load).
  
A./

Re: comma ellipses

2019-10-07 Thread Asmus Freytag (c) via Unicode

Now you are introducing research - that kills all the fun . . . (oops ,
, , )

A./

On 10/6/2019 10:39 PM, Tex wrote:

Just for additional info on the subject:

https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book

“…I’ve been spending a fair bit of time recently with the comma
ellipsis, which is three commas (,,,) instead of dot-dot-dot. I’ve
been looking at it for over a year and I’m still figuring out what’s
going on there. There seems to be something but possibly several
somethings.

One use is by older people who, in some cases where they would use the
classic ellipsis, use commas instead. It’s not quite clear if that’s a
typo in some cases, but it seems to be more systematic than that.
Maybe they’re preferring the comma because it’s a little bit easier to
see if you’re on the older side, and your vision is not what it once
was. Or maybe they just see the two as equivalent. It then seems to
have jumped the shark into parody form. There’s a Facebook group in
which younger people pretend to be to be baby boomers, and one of the
features people use there is this comma ellipsis. And then in some
circles there also seems to be a use of comma ellipses that is very,
very heavily ironic. But what exactly the nature is of that heavy
irony is still something that I’m working on figuring out….”

*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of
*Asmus Freytag via Unicode

*Sent:* Sunday, October 6, 2019 10:21 PM
*To:* unicode@unicode.org
*Subject:* Re: comma ellipses

On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote:

It’s deliberately incorrect for humorous effect. It gets used, but
making it “official” would almost defeat the purpose.

Well then it should encode a "typographically incorrect" comma ellipsis :)

A./

On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode
mailto:unicode@unicode.org>> wrote:

On 10/6/2019 4:05 PM, Tex via Unicode wrote:

Now that comma ellipses (,,,) are a thing (at least on
social media) do we need a character proposal?

Asking for a friend,,, J

tex

I thought the main reason we ended up with the period (dot)
one is because it was originally needed for CJK-style fixed
grid layout purposes. But It could be wrong.

What's the current status for 3-dot ellipsis. Does it get
used? Do we have autocorrect for it? If so, that would argue
that implementers have settled and any derivative usage
(comma) should be kept compatible.

A./

Re: comma ellipses

2019-10-06 Thread Asmus Freytag via Unicode


  
  
On 10/6/2019 8:21 PM, Garth Wallace via
  Unicode wrote:


  
  
It’s deliberately incorrect for humorous effect.
  It gets used, but making it “official” would almost defeat the
  purpose.
  

Well then it should encode a "typographically incorrect" comma
  ellipsis :)
A./


  

  On Sun, Oct 6, 2019 at 5:02
    PM Asmus Freytag via Unicode <unicode@unicode.org>
wrote:
  
  

  On 10/6/2019 4:05 PM, Tex via Unicode wrote:
  
  

  Now that comma ellipses (,,,) are
a thing (at least on social media) do we need a
character proposal?
   
  Asking for a friend,,, J
   
  tex

  


  I thought the main reason we ended
  up with the period (dot) one is because it was
  originally needed for CJK-style fixed grid layout
  purposes. But It could be wrong.
  What's the current status for
  3-dot ellipsis. Does it get used? Do we have
  autocorrect for it? If so, that would argue that
  implementers have settled and any derivative usage
  (comma) should be kept compatible.


  

  A./

Re: comma ellipses

2019-10-06 Thread Asmus Freytag via Unicode


  
  
On 10/6/2019 4:05 PM, Tex via Unicode
  wrote:


  
  
  
  
Now that comma ellipses (,,,) are a thing
  (at least on social media) do we need a character proposal?
 
Asking for a friend,,, J
 
tex
  

I thought the main reason we ended up with
the period (dot) one is because it was originally needed for
CJK-style fixed grid layout purposes. But It could be wrong.
What's the current status for 3-dot
ellipsis. Does it get used? Do we have autocorrect for it? If
so, that would argue that implementers have settled and any
derivative usage (comma) should be kept compatible.
  
A./

Re: Alternative encodings for Malayalam “nta”

2019-10-06 Thread Asmus Freytag (c) via Unicode


On 10/6/2019 11:57 AM, 梁海 Liang Hai wrote:

Folks,

(Microsoft Peter and Andrew, search for “Windows” in the document.)

(Asmus, in the document there’s a section 5, /ICANN RZ-LGR 
situation/—let me know if there’s some news.)


The issue, as it affects domain names, has been brought to the authors 
of the Malayalam Root Zone LGR proposal, the Neo-Brahmi Generation 
Panel; however, there is no new status to report at this time. I would 
appreciate if you could keep me updated on any details of the UTC 
decision (particularly those that do not make the rather terse UTC minutes).


A./




This is a pretty straightforward document about the notoriously 
problematic encoding of Malayalam /rra/>. I always wanted to properly document this, so finally here it is:


L2/19-345

*Alternative encodings for Malayalam "nta"*
Liang Hai
2019-10-06


Unfortunately, as  has already become the de facto 
standard encoding, now we have to recognize it in the Core Spec. It’s 
a bit like another Tamil /srī/ situation.


An excerpt of the proposal:

Document the following widely used encoding in
the Core Specification as an alternative representation for
Malayalam [glyph] () that is a
special case and does not suggest any productive rule in the
encoding model:




Best,
梁海 Liang Hai
https://lianghai.github.io

Re: Alternative encodings for Malayalam “nta”

2019-10-06 Thread Asmus Freytag (c) via Unicode


Have you submitted that response as a UTC document?
A./

On 10/6/2019 2:08 PM, Cibu wrote:
Thanks for addressing this. Here is my response: 
https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/


In summary, my take is:

The sequence  for ൻ്റ (<>) 
should not be legitimized as an alternate encoding; but should be 
recognized as a prevailing non-standard legacy encoding.



On Sun, Oct 6, 2019 at 7:57 PM 梁海 Liang Hai > wrote:


Folks,

(Microsoft Peter and Andrew, search for “Windows” in the document.)

(Asmus, in the document there’s a section 5, /ICANN RZ-LGR
situation/—let me know if there’s some news.)

This is a pretty straightforward document about the notoriously
problematic encoding of Malayalam . I always wanted to properly document this, so finally here
it is:

L2/19-345

*Alternative encodings for Malayalam "nta"*
Liang Hai
2019-10-06


Unfortunately, as  has already become the de
facto standard encoding, now we have to recognize it in the Core
Spec. It’s a bit like another Tamil /srī/ situation.

An excerpt of the proposal:

Document the following widely used encoding in
the Core Specification as an alternative representation for
Malayalam [glyph] () that
is a special case and does not suggest any productive rule in
the encoding model:




Best,
梁海 Liang Hai
https://lianghai.github.io

Re: On the lack of a SQUARE TB glyph

2019-09-30 Thread Asmus Freytag via Unicode


  
  
On 9/30/2019 1:01 AM, Andre Schappo via
  Unicode wrote:


  

  
On Sep 27, 1 Reiwa, at 08:17, Julian Bradfield via Unicode  wrote:

Or one could allow IDS to have leaf components that are any
characters, not just ideographic characters, and then one could have
all sorts of fun.

  
  
I do like this idea.

Note: This is a modified repost as I previously forgot to credit Julian as the originator

André Schappo




And to keep my previous reply in context: I think the "all sorts
  of fun" would be the wrong reason to do things. However, things
  like squared abbreviations and squared kana words all occur in the
  context of typesetting text containing ideographs. Therefore,
  extending the IDS slightly, so that it can cover those use cases,
  would make a certain amount of sense. While the result being
  composed (or "described") wouldn't be an actual Han ideograph, it
  would nevertheless function like one typographically.
That makes that suggestion a rather appropriate alternative for
  things like *SQUARE TB.
The kinds of fonts that might have a mapping from some IDS to a
  single glyph might also have glyphs that correspond to popular
  squared abbreviations. 

And the way the components are stacked is at least broadly
  similar to (or better a subset of) the ways ideographic components
  can be stacked. On might start out by disallowing things like the
  surround operators in favor of simply doing things like "two up"
  and "side by side" for starters.
In other words, not "all sorts of fun" but something targeted to
  precisely needed for the extension of the frozen subset of
  abbreviations so that they can occur in contexts that do not allow
  full markup languages without having to be precomposed.

A./

Re: On the lack of a SQUARE TB glyph

2019-09-29 Thread Asmus Freytag via Unicode


  
  
On 9/29/2019 7:42 AM, Andre Schappo via
  Unicode wrote:


  

  
Or one could allow IDS to have leaf components that are any
characters, not just ideographic characters, and then one could have
all sorts of fun.

  
  
I do like that idea

André Schappo







That could be an appropriate extension for
the kind of "grouping characters in an ideographic cell" that
applies to examples like the *SQUARE TB.
The outcome of that process is by definition
compatible for use with ideographs and not an open ended
compositional scheme. 
  
A./

Re: Proposing mostly invisible characters

2019-09-13 Thread Asmus Freytag via Unicode


  
  
On 9/13/2019 10:50 AM, Richard
  Wordingham via Unicode wrote:


  On Fri, 13 Sep 2019 08:56:02 +0300
Henri Sivonen via Unicode  wrote:


  
On Thu, Sep 12, 2019, 15:53 Christoph Päper via Unicode
 wrote:



  ISHY/SIHY is especially useful for encoding (German) noun compounds
in wrapped titles, e.g. on product labeling, where hyphens are often
suppressed for stylistic reasons, e.g. orthographically correct
_Spargelsuppe_, _Spargel-Suppe_ (U+002D) or _Spargel‐Suppe_
(U+2010) may be rendered as _Spargel␤Suppe_ and could then be
encoded as _SpargelSuppe_.
 



Why should this stylistic decision be encoded in the text content as
opposed to being a policy applies on the CSS (or conceptually
equivalent) layer?

  
  
How would you define such a property?

Richard.





We should start with whether such a
stylistic choice is general enough so that support in one or the
other standard is indicated.
Color me "not convinced" on that point.
If product names (or descriptions) are
wrapped in non-standard ways on products and in advertising that
may well be common in those instances, but they are like signage
and not running text. The designer will either use two text
boxes or use a fixed sized one an insert a space to get the
(typo-)graphical appearance desired.
Short of seeing this in a block of text on a
website where that block is resized with screen size or
resolution, I think we are arguing far ahead of an actual use
case.
  
A./

Re: Proposing mostly invisible characters

2019-09-13 Thread Asmus Freytag via Unicode


  
  
On 9/12/2019 5:53 AM, Christoph Päper
  via Unicode wrote:


  ISHY/SIHY is especially useful for encoding (German) noun compounds in wrapped titles, e.g. on product labeling, where hyphens are often suppressed for stylistic reasons, e.g. orthographically correct _Spargelsuppe_, _Spargel-Suppe_ (U+002D) or _Spargel‐Suppe_ (U+2010) may be rendered as _Spargel␤Suppe_ and could then be encoded as _SpargelSuppe_.

Can you provide examples where this happens
in text that is not fixed layout, that is, a product website,
rather than a product label? For fixed layout, you cannot, in
principle, know that there wasn't a regular space used (or two
separate text boxes, or any other means to get the effect). 
  
A./

Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Asmus Freytag via Unicode


  
  
On 8/14/2019 7:49 PM, James Kass via
  Unicode wrote:


  
  On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:
  
  Empirically, it has been observed that
some distinctions that are claimed by

users, standards developers or implementers were de-facto not
honored by type

developers (and users selecting fonts) as long as the native
text doesn't

contain minimal pairs.

  
  
  Quickly checked a couple of older on-line PDFs and both used the
  comma below unabashedly.
  
  
  Quoting from this page (which appears to be more modern than the
  PDFs),
  
  http://www.trussel2.com/MOD/peloktxt.htm
  
  
  "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo
  juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne
  depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab
  kanooj ememej. Wa in ṃōṃkaj kar ..."
  
  
  It seems that users are happy to employ a dot below in lieu of
  either a comma or cedilla.  This newer web page is from a book
  published in 1978.  There's a scan of the original book cover.
  Although the book title is all caps hand printing it appears that
  commas were used.  The Marshallese orthography which uses
  commas/cedillas is fairly recent, replacing an older scheme
  devised by missionaries.  Perhaps the actual users have already
  resolved this dilemma by simply using dots below.
  
  
  

That may be the case for Marshallese. But
wouldn't surprise me.
  
My comments were based on a different case
of the same kinds of diacritics below (other languages) and at
the time we consulted typographic samples including newsprint
that were using pre-Unicode technologies. In that sense a
cleaner case, because there was no influence by what Unicode did
or didn't do.
Now, having said that, I do get it that some
materials, like text books, online class materials etc. need to
be prepared / printed using the normative style for the given
orthography.
But it's a far cry from claiming that all
text in a given language is invariably done only one way.
A./

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Asmus Freytag via Unicode


  
  
On 8/14/2019 2:05 AM, James Kass via
  Unicode wrote:

This
  presumes that the premise of user communities feeling strongly
  about the unacceptable aspect of the variants is valid.  Since it
  has been reported and nothing seems to be happening, perhaps the
  casual users aren't terribly concerned.  It's also possible that
  the various user communities have already set up their systems to
  handle things acceptably by installing appropriate fonts.

This is always a good question.
Empirically, it has been observed that some
distinctions that are claimed by users, standards developers or
implementers were de-facto not honored by type developers (and
users selecting fonts) as long as the native text doesn't
contain minimal pairs.
For example, some Latin fonts drop the dot
on the lowercase i for stylistic reasons (or designers use
dotless i in highly designed texts, like book covers, logos,
etc.). That's usually not a problem for ordinary users for
monolingual texts in, say English; even though everyone agrees
that the lowercase i is normally dotted, the absence isn't
noticed by most, and tolerated even by those who do notice it.
However, as soon as a user community sees a
particular variant as signalling their group identity, they will
be very vocal about it - even, interestingly enough, in cases
where de-facto use (e.g. via font selection, and not forced by
implementation defaults) doesn't match that preference. As I
said, we've seen this in the past for some features in some
languages.
Now, which features become strongly
identified with group identity is something that subject to
change over time; this makes it impossible to guarantee both
absolute stability and perfect compatibility; especially if a
combining mark that is used in decompositions needs to
disunified because the range of shapes changes from being
stylistic to normative.
Before Unicode, with character sets limited
to local use, you couldn't create minimal pairs (except if the
variation was part of your language, like Turkish i with/without
dot). So, if font deviated and pushed the stylistic envelope,
the non-preferred form, if used, would still necessarily refer
to the local character; there was no way it could mean anything
else. With Unicode, that's changed, and instead of user
communities treating this as a typographic issue (exclusive use
of preferred font) which is decentralized to document authors
(and perhaps font vendors) it becomes a character coding issue
that is highly visible and centralized.
That in turn can lead to the issue becoming
politicized; and not unlike some grammar issues, where the
supposedly "correct" form is far from universally agreed on in
practice.
A./

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Asmus Freytag via Unicode


  
  
On 8/8/2019 1:06 AM, Richard Wordingham
  via Unicode wrote:


  This is not compliant with Unicode, but
neither is deliberately treating canonically equivalent forms
differently.

That.
A./

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag via Unicode


  
  
On 8/7/2019 5:33 PM, Andrew Glass via
  Unicode wrote:


  
  
  
  
I agree and understand that accurate
representation is important in this case. It would be good
to understand how widespread the issue is in order to begin
to justify the work to retrofit shaping with normalization.
The number of problematic strings may be small but the risk
of regression in this case might be quite large.
  

Not sure how to quantify this. Potentially every URL (assuming
  that local users eventually migrate to non-ASCII domains). Then
  again, not all of these will be normalized in the document. 

I don't know the precise behavior of address bar / status bar. I
  know that when you type in an uppercase ASCII domain name, it will
  resolve, but the lower case name is echoed.
Can't tell immediately whether that means that for names that are
  normalized for lookup, you also get the canonical name displayed.
  If so, then every single (local) URL in those scripts is
  potentially affected.
A./




  

 
Cheers,
 
Andrew
 

  
From: Asmus Freytag (c)


Sent: 07 August 2019 17:17
To: Andrew Glass
; Unicode Mailing List

Subject: Re: What is the time frame for USE
shapers to provide support for CV+C ?
  

 

  On 8/7/2019 5:08 PM, Andrew Glass wrote:


  Shaping domain names is a new requirement. It
  would be good to understand the specific cases that are
  falling in the gap here.

Domain names are simply strings, but the protocol enforces
  normalization to NFC. In some situations, it might be possible
  for a browser, for example, to have access to the
  user-provided string, but I can see any number of situations
  where the actual string (as stored in the DNS) would need to
  be displayed.
For the scenario, it does not matter whether it's NFC or NFD,
  what matters is that some particular un-normalized state would
  be lost; and therefore it would be bad if the result is that
  the string can no longer be rendered correctly.
In particular, as the strings in question would be
  identifiers, where accurate recognition is prime.
A./

   
  

  From: Unicode
  
  
On Behalf Of Asmus Freytag via Unicode
  Sent: 07 August 2019 14:19
  To: unicode@unicode.org
  Subject: Re: What is the time frame for USE
  shapers to provide support for CV+C ?

  
   
  
What about text that must exist
  normalized for other purposes?
  
  
 
  
  
Domain names must be normalized to NFC,
  for example. Will such strings display correctly if passed
  to USE?
  
  
 
  
  
A./
  
  
 
  
  
On 8/7/2019 1:39 PM, Andrew Glass via
  Unicode wrote:
  
  
That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process.
Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize.
 
By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread.
 
Cheers,
 
Andrew
 
-Original Message-
From: Richard Wordingham  
Sent: 07 August 2019 13:29
To: Richard Wordingham via Unicode 
Cc: Andrew Glass 
Subject: Re: What is the time frame for USE shapers to provide support for CV+C ?
 
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:
 

  On Tue, 14 May 2019 00:58:07 +
  Andrew Glass via Unicode  wrote

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag (c) via Unicode


On 8/7/2019 5:08 PM, Andrew Glass wrote:


Shaping domain names is a new requirement. It would be good to 
understand the specific cases that are falling in the gap here.


Domain names are simply strings, but the protocol enforces normalization 
to NFC. In some situations, it might be possible for a browser, for 
example, to have access to the user-provided string, but I can see any 
number of situations where the actual string (as stored in the DNS) 
would need to be displayed.


For the scenario, it does not matter whether it's NFC or NFD, what 
matters is that some particular un-normalized state would be lost; and 
therefore it would be bad if the result is that the string can no longer 
be rendered correctly.


In particular, as the strings in question would be identifiers, where 
accurate recognition is prime.


A./

*From:*Unicode  *On Behalf Of *Asmus 
Freytag via Unicode

*Sent:* 07 August 2019 14:19
*To:* unicode@unicode.org
*Subject:* Re: What is the time frame for USE shapers to provide 
support for CV+C ?


What about text that must exist normalized for other purposes?

Domain names must be normalized to NFC, for example. Will such strings 
display correctly if passed to USE?


A./

On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:

That's correct, the Microsoft implementation of USE spec does not normalize 
as part of the shaping process.

Why? Because the ccc system for non-Latin scripts is not a good mechanism 
for handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai 
Tham experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-

From: Richard Wordingham  <mailto:richard.wording...@ntlworld.com>  


Sent: 07 August 2019 13:29

To: Richard Wordingham via Unicode  
<mailto:unicode@unicode.org>

Cc: Andrew Glass  
<mailto:andrew.gl...@microsoft.com>

Subject: Re: What is the time frame for USE shapers to provide support for 
CV+C ?

On Tue, 14 May 2019 03:08:04 +0100

Richard Wordingham via Unicode  
<mailto:unicode@unicode.org>  wrote:

On Tue, 14 May 2019 00:58:07 +

Andrew Glass via Unicode  
<mailto:unicode@unicode.org>  wrote:

Here is the essence of the initial changes needed to support CV+C.

Open to feedback.

   *   Create new SAKOT class

SAKOT (Sk) based on UISC = Invisible_Stacker

   *   Reduced HALANT class

Now only HALANT (H) based on UISC = Virama

   *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*

(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk

B)* (FAbv)* (FBlw)* (FPst)* [FM]

This next question does not, I believe, affect HarfBuzz.  Will NFC

code render as well as unnormalised code?  In the first example above,

 normalises to , which

does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support

normalization") still implies that the USE does not respect canonical 
equivalence.

Richard.

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag via Unicode


  
  
What about text that must exist
  normalized for other purposes?


Domain names must be normalized to NFC,
  for example. Will such strings display correctly if passed to USE?


A./



On 8/7/2019 1:39 PM, Andrew Glass via
  Unicode wrote:


  That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process.
Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-
From: Richard Wordingham  
Sent: 07 August 2019 13:29
To: Richard Wordingham via Unicode 
Cc: Andrew Glass 
Subject: Re: What is the time frame for USE shapers to provide support for CV+C ?

On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:


  
On Tue, 14 May 2019 00:58:07 +
Andrew Glass via Unicode  wrote:



  Here is the essence of the initial changes needed to support CV+C.
Open to feedback.


  *   Create new SAKOT class
SAKOT (Sk) based on UISC = Invisible_Stacker
  *   Reduced HALANT class
Now only HALANT (H) based on UISC = Virama
  *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

  
[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk
B)* (FAbv)* (FBlw)* (FPst)* [FM]

  

  
  

  
This next question does not, I believe, affect HarfBuzz.  Will NFC 
code render as well as unnormalised code?  In the first example above, 
 normalises to , which 
does not match any portion of the regular _expression_.

  
  
Could someone answer this question, please?  The USE documentation ("CGJ handling will need to be updated if USE is modified to support
normalization") still implies that the USE does not respect canonical equivalence.

Richard.

Re: New website

2019-07-22 Thread Asmus Freytag via Unicode


  
  
On 7/22/2019 10:00 AM, Ken Whistler via
  Unicode wrote:

Your
  helpful suggestions will be passed along to the people working on
  the new site.
  
  
  In the meantime, please note that the link to the "Unicode
  Technical Site" has been added to the left column of quick links
  in the page bottom banner, so it is easily available now from any
  page on the new site.
  

(If you ever get to anything other than the "vanity" characters -
  not a given for some devices).
(Also, the "Projects" need to be their own item on the "left",
  not hidden in "basics").

A./


  
  --Ken
  
  
  On 7/22/2019 9:54 AM, Zachary Carpenter wrote:
  
  It seems that many of the concerns
expressed here could be resolved with a menu link to the
“Unicode Technical Site” on the left-hand menu bar

Re: Displaying Lines of Text as Line-Broken by a Human

2019-07-21 Thread Asmus Freytag via Unicode


  
  
There's really no inherent need for
  many spacing combining marks to have a base character. At least
  the ones that do not reorder and that don't overhang the base
  character's glyph.



As far as I can  tell, it's largely a
  convention that originally helped identify clusters and other lack
  of break opportunities. But now that we have separate properties
  for segmentation, it's not strictly necessary to overload the
  combining property for that purpose.


In you example, why do you need the ZWJ
  and dotted circle?


Originally, just applying a combining
  mark to a NBSP should normally show the mark by itself. If a font
  insists on inserting a dotted circle glyph, that's not required
  from a conformance perspective - just something that's seen as
  helpful (to most users).


A./



On 7/21/2019 4:03 PM, Richard
  Wordingham via Unicode wrote:


  I've been transcribing some Pali text written on palm leaf in the
Tai Tham script.  I'm looking for a way of reflecting the line
boundaries in a manuscript in a transcription.  The problem is that
lines sometimes start or end with an isolated spacing mark.  I want
my text to be searchable and therefore encoded in Unicode.  (I
appreciate that There is a trade-off between searchability and showing
line boundaries.  The unorthodox spelling is also a problem.)

How unreasonable is it for a font to render



as just the spacing mark?  Some rendering systems give the font no way
of distinguishing dotted circles in the backing store from dotted
circles added by the renderer, so this technique is not Unicode
compliant.

An alternative solution is to have a parallel font (or, more neatly, a
feature) that renders some base character (or sequence) as a zero-width
non-inking character.  This, however, would violate that character's
identity.  I suspect there is no Unicode-compliant solution.

Richard.

Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Asmus Freytag via Unicode


  
  
On 7/17/2019 6:03 PM, Richard
  Wordingham via Unicode wrote:


  On Thu, 18 Jul 2019 01:54:52 +0200
Philippe Verdy via Unicode  wrote:


  
In fact the ligatures system for the "cursive" Egyptian Hieratic is so
complex (and may also have its own variants showing its progression
from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic
should no longer be considered "unified" with Hieroglyphs, and its
existing ISO 15924 code is then not represented at all in Unicode.

  
  
Writing hieroglyphic text as plain text has only been supported since
Unicode 12.0, so it may take a little while to explore workable encoding
conventions.

A significant issue is that the hieratic script is right to left but
Unicode only standardises the encoding of left-to-right
transcriptions.  I don't recall the difference between retrograde v.
normal text being declared a style difference.

Use directional overrides. Those have been in the standard
  forever. 

A./


  

For comparison, we still have no guidance on how to encode sexagesimal
Mesopotamian cuneiform numbers, e.g. '610' v. '20' written using the U
graphic element.

Richard.

Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode


On 7/17/2019 11:25 AM, Sławomir Osipiuk wrote:


“Transliteration”?

Maybe more generic that what you’re looking for. Used for the process 
of producing the “machine readable zone” on passports:


https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see 
section 6, page 30)


“Accent folding” or “diacritic folding” is used in some places. String 
folding is “A string transform F, with the property that repeated 
applications of the same function F produce the same output: F(F(S)) = 
F(S) for all input strings S”. Accent folding is a special case of that.


https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions

https://alistapart.com/article/accent-folding-for-auto-complete/

Diacritic folding. Thanks. Just didn't think of the operation as folding 
the way it came up, but that's what it is.


A./


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Wednesday, July 17, 2019 13:38
*To:* Unicode Mailing List
*Subject:* Removing accents and diacritics from a word

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São 
Tomé to Sao Tome?


The linguistic term "string normalization" appears not that preferable 
in a computing context.


Any ideas?

A./

Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode


On 7/17/2019 11:37 AM, Tex wrote:


Asmus, are you including the case where an accented character maps to 
two unaccented characters?


e.g. Å to AA or Ä to AE

If that's covered by the same term; but it's not simple 
"typewriter/telegraph" fallback.





*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag (c) via Unicode

*Sent:* Wednesday, July 17, 2019 11:07 AM
*To:* Norbert Lindenberg
*Cc:* Unicode Mailing List
*Subject:* Re: Removing accents and diacritics from a word

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?

Not helpful. Anybody have a serious suggestion?

A./

On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
<mailto:unicode@unicode.org>  wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São Tomé to 
Sao Tome?

The linguistic term "string normalization" appears not that preferable 
in a computing context.

Any ideas?

A./

Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode


On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?


Not helpful. Anybody have a serious suggestion?

A./





On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing accents and 
diacritics from a word to create its “base form”, e.g. São Tomé to Sao Tome?

The linguistic term "string normalization" appears not that preferable in a 
computing context.

Any ideas?

A./

Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag via Unicode


  
  
A question has come up in another
context:
  
Is there any linguistic term for
describing the process of removing accents and diacritics from a
word to create its “base form”, e.g. São Tomé to Sao Tome?
The linguistic term "string normalization" appears not that
  preferable in a computing context.
Any ideas?

A./

Re: Proposal to extend the U+1F4A9 Symbol

2019-05-31 Thread Asmus Freytag via Unicode


  
  
On 5/31/2019 7:12 AM, Michael Everson
  via Unicode wrote:


  No, thank you.


Not so fast. I think we need to hear from the telemdicine
  community first.
A./


  

  
On 31 May 2019, at 11:18, bristol_poo via Unicode  wrote:

Greetings,

I hope I dont intrude too much on this list with a proposal.

U+1F4A9, aka the 'pile of poo' emoji, has gained somewhat of a legendary status in the modern society [1]. 

With the somewhat recent addition of skin tones in the Emoji Modifier Sequences, I think there is some small room to add more depth to the emoji by modulating it via the Bristol Scale [2].

This would produce 7 variants of the U+1F4A9 emoji, including existing (Which I believe is about Type 4 on the scale). 

Why? I think this would really benefit the medical profession, with a large uptick in e-doctor/text only conversations towards the medical profession. 

Cheers
/BP

[1] We even have plush toys dedicated to this emoji https://www.amazon.co.uk/Emoji-Shape-Pillow-Cushion-Stuffed/dp/B00VL55Q8O
[2] https://en.wikipedia.org/wiki/Bristol_stool_scale

Re: unicode tweet

2019-05-30 Thread Asmus Freytag via Unicode


  
  
On 5/30/2019 1:07 AM, Andre Schappo via
  Unicode wrote:


  
  
  
  This tweet made me laugh twitter.com/padolsey/status/1133835770773626881 勞
  
  
  André Schappo

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Asmus Freytag via Unicode


  
  
On 5/15/2019 4:22 AM, Costello, Roger
  L. via Unicode wrote:


  Hello Unicode experts!

Which is correct:

(a) The input file contains a string. The string is encoded using UTF-8.

(b) The input file contains a string. The string is encoded with UTF-8.

(c) The input file contains a string. The string is encoded in UTF-8.

(d) Something else (what?)

/Roger




I would say I've seen all three uses about
equally.
If you search for each phrase, though, "in"
comes up as the most frequent one.
That would make the last one, or simply "in
UTF-8" (that is, without the "encoded") good choices for general
audiences.
A./

Re: Symbols of colors used in Portugal for transport

2019-05-02 Thread Asmus Freytag via Unicode


  
  
On 5/2/2019 8:44 AM, J Andrew Lipscomb
  via Unicode wrote:


  Why not just use U+25E4 and U+25E2 for the triangles, and U+2215 for the diagonal?




Why not wait for evidence of that scheme
being used in text. Then we know.
A./

Re: Emoji boom?

2019-05-01 Thread Asmus Freytag via Unicode


  
  
On 5/1/2019 3:23 AM, Shriramana Sharma
  via Unicode wrote:


  http://www.unicode.org/L2/L-curdoc.htm

The number of emoji-related proposals seems to be increasing compared
to the number of script-related ones.

Have we reached a plateau re scripts encoding?

Somehow this seems sad to me considering the great role Unicode played
in bringing Indic scripts (from my POV as an Indian) to mainstream
digital devices.

--
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा ူ၆ိျိါအူိ၆ါး




On some level, this is an inevitable
transition.
Once all characters in current and historic
use have been encoded, the only remaining proposals would be for
characters that newly enter circulation.
The only unexpected thing is that the "new"
characters are not limited to orthographic reform and new
currency signs, but happen to be emoji.
Blame the fact that those very same "digital
devices" have changed the way people communicate.
:)
A./

Re: Fw: Latin Script Danda

2019-04-19 Thread Asmus Freytag via Unicode


  
  
On 4/19/2019 6:57 PM, Shriramana Sharma
  via Unicode wrote:


  
  
I don't know many modern fonts that display 007C
  as a broken glyph. In fact I haven't seen a broken line pipe
  glyph since the MS-DOS days. Nowadays we have 00A6 for that.


  

Same here. In fact, couldn't find any
example among installed fonts on a Windows 7 (not  even Windows
10) system before running out of patience. That seems to
indicate that the disunification of vertical bar and broken bar
was complete 10 years ago.
  
A./

Re: Emoji Haggadah

2019-04-16 Thread Asmus Freytag via Unicode


  
  


 I
  suspect that this work would be jibber-jabber to any non-English
  speaker unfamiliar with the original Haggadah.  No matter how
  otherwise fluent they might be in emoji communication.

You can't escape fundamental theses:


  There is a well-known thesis in linguistics that every script has to be 
at least in part phonetic, and the above are examples that add support 
to this. For deeper explanations (unfortunately not yet including 
emoji), see e.g. "Visible Speech - The Diverse Oneness of Writing 
Systems", by John DeFrancis, University of Hawaii Press, 1989. 

  
Going further: emoji are also subject to being "conventionalized"
  if that is the term, that is that conventions come about so that
  some image stands for a concept even if that image isn't directly
  connected.
Some examples are telephone handsets and other early form of
  technology standing in for later versions of the same thing.
  (Floppy disk icon for "save").
More of that will happen with the full spectrum of emoji and
  these conventions may then also no longer be universal but
  specific to some group of users.
At which point, you are back at where the other pictographic
  writing systems started to evolve.
A./

Re: USE Indic Syllabic Category

2019-02-22 Thread Asmus Freytag via Unicode


  
  
On 2/22/2019 7:29 AM, Richard
  Wordingham via Unicode wrote:


  On Fri, 22 Feb 2019 09:07:06 +
Richard Wordingham via Unicode  wrote:


  
My best hypothesis (not thoroughly tested) is that Windows currently
has InSc=Consonant_Killer, but can I look his up as opposed to
effectively devising a test suite for USE on Office?

  
  
That question's rather mangled.  It should have said:

My best hypothesis (not thoroughly tested) is that Windows currently
has InSc=Consonant_Killer, but can where I look this up as opposed to
effectively devising a test suite for USE on Windows?

FWIW, HarfBuzz currently has VAbv 'vowel above', in accordance with the
Unicode 11.0 properties.

Richard.



"can where I"  is perhaps not as much an
improvement  :)
A./

Re: Encoding colour (from Re: Encoding italic)

2019-02-13 Thread Asmus Freytag via Unicode


  
  
On 2/13/2019 5:19 PM, Mark E. Shoulson
  via Unicode wrote:

 And
  again, all this is before we even consider other issues; I can't
  shake the feeling that there security nightmares lurking inside
  this idea.

Default ignorables are bad juju.
A./

Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode


On 2/9/2019 1:40 PM, Egmont Koblinger wrote:

On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode
 wrote:


I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.

See other messages: not.

For the crossword analogy, I can see why it's not good. But this
doesn't mean there aren't any other ideas we could experiment with.



"all...scripts" is the issue.  We know how to handle text for all 
scripts and what complexities one has to account for in order to do 
that. You can back off some corner cases or (slightly) degrade things, 
but even after you are done with that, there will be scripts where the 
"more or less compromises" forces by the design parameters you gave will 
mean an utterly unacceptable display.


That said, there are scripts that had "passable" typewriter 
implementations and it may be possible to tweak things to approach that 
level support. Don't know for sure, it depends on the details for each 
script.





Or do you mean to say that because it can't be made perfect, there's
no point at all in partially improving? I don't think I agree with
that.



It's more a question of being upfront with your goal.

At this point I understand it as accepting some design parameters as 
fundamental and seeing whether there are some tweaks that allow more 
scripts to work with or to "survive" given the constraints.


That's not a totally useless effort, but it is a far cry from Unicode's 
universal support for ALL writing systems.


A./

PS: also we have been seriously hijacking a thread related to bidi




e.

Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode


On 2/9/2019 11:48 AM, Egmont Koblinger wrote:

Hi Asmus,


On quick reading this appears to be a strong argument why such emulators will
never be able to be used for certain scripts. Effectively, the model described 
works
well with any scripts where characters are laid out (or can be laid out) in 
fixed
width cells that are linearly adjacent.

I'm wondering if you happen to know:

Are there any (non-CJK) scripts for which a mechanical typewriter does
not exist due to the complexity of the script?


Egmont,

are you excluding CJK because of the difficulty handling a large
repertoire with mechanical means? However, see:

https://en.wikipedia.org/wiki/Chinese_typewriter




Are there any (non-CJK) scripts for which crossword puzzles don't exist?

For scripts where these do exist, is it perhaps an acceptable tradeoff
to keep their limitations in the terminal emulator world as well, to
combine the terminal emulator's power with these scripts?



I agree with you that crossword puzzles and scrabble have a similar
limitation to the design that you sketched for us. However, take a script
that is written in syllables (each composed of 1-5 characters, say).

In a "crossword" I could write this script so that each syllable occupies
a cell. It would be possible to read such a puzzle, but trying to use 
such a draconian
technique for running text would be painful, to say the least. (We are 
not even

talking about pretty, here).

Here's an example for Hindi:
https://vargapaheli.blogspot.com/2017/
I don't read Hindi, but 5 vertical in the top puzzle, cell 2, looks like 
it contains

both a consonant and a vowel.

To force Hindi crosswords mode you need to segment the string into 
syllables,
each having a variable number of characters, and then assign a single 
display

position to them. Now some syllables are wider than others, so you could use
the single/double width paradigm. The result may be somewhat legible for
Devanagari, but even some of the closely related scripts may not fit 
that well.


Now there are some scripts where the same syllable can be written in more
than one form; the forms differing by how the elements are fused (or 
sometimes
not fused) into a single shape. Sometimes, these differences are more 
"stylistic",
more like an 'fi' ligature in English, sometimes they really indicate 
different words,
or one of the forms is simply not correct (like trying to spell lam-alif 
in Arabic using

two separate letters).

I'm sure there are scripts that work rather poorly (effectively not at 
all) in cross-

word mode. The question then becomes one of goals.

Are you defining as your goal to have some kind of "line by line" 
display that
can survive any Unicode text thrown at it, or are you trying to extend a 
given
design with rather specific limitations, so that it survives / can be 
used with,

just a few more scripts than European + CJK?



Honestly, even with English, all I have to do is "cat some_text_file",
and chances are that a word is split in half at some random place
where it hits the right margin. Even with just English, a terminal
emulator isn't something that gives me a grammatically and
typographically super pleasing or correct environment. It gives me
something that I personally find grammatically and typographically
"good enough", and in the mean time a powerful tool to get my work
done.



The discrepancies would be more like throwing random blank spaces in the
middle of every word, writing letters out of order, or overprinting. So, 
more

fundamental, not just "not perfect".

To give you an idea, here is an Arabi crossword. It uses the isolated 
shape of

all letters and writes all words unconnected. That's two things that may be
acceptable for a puzzle, but not for text output.

http://www.everyday-arabic.com/2013/12/crossword1.html

(try typing 3 vertical as a word to see the difference - it's 4x U+062A)



Obviously the more complex the script, the more tradeoffs there will
be. I think it's a call each user has to make whether they prefer a
terminal emulator or a graphical app for a certain kind of task. And
if terminal emulators have a lower usage rate in these scripts, that's
not necessarily a problem. If we can improve by small incremental
changes, sure, let's do. If we'd need to heavily redesign plenty of
fundamentals in order to improve, it most likely won't happen.

You may begin to see the limitations and that they may well prevent you 
from
reaching even your limited goal for speakers of at least three of the 
top ten languages

worldwide.

A./

Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag via Unicode


  
  
On 2/9/2019 12:07 PM, Egmont Koblinger
  via Unicode wrote:


  On Sat, Feb 9, 2019 at 9:01 PM Eli Zaretskii  wrote:


  
then what you say is that some scripts
can never be supported by text terminals.

  
  
I'm not familiar at all with all the scripts and their requirements,
but yes, basically this is what I'm saying. I'm afraid some scripts
can never be perfectly supported by text terminals.



This includes the scripts used for up to four of the world's top
  ten languages.
And it's more than "not perfect"; effectively some scripts cannot
  be shoehorned
  into the fundamental design.
That design was created to work with European scripts, and proved
  somewhat
  adaptable to other scripts that lend themselves to fixed-width
  cell display. But
  beyond that is where you hit the proverbial brick wall.


  

I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.



See other messages: not.



  

Maybe one day some new, modern platform will arise with the goal of
replacing terminal emulators, which I wouldn't necessarily mind. It's
gonna take an enormous amount of work, though.


A./

Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag via Unicode


  
  
On quick reading this appears to be a
  strong argument why such emulators will
never be able to be used for certain
  scripts. Effectively, the model described works
well with any scripts where characters
  are laid out (or can be laid out) in fixed
width cells that are linearly adjacent.


There are some crude techniques that
  allow an extension to cover scripts that
require half-width or double-width
  cells, and perhaps even zero-width.


However, scripts, where rendering
  involves complicated ligatures or other
  typographical interactions that often are specific to a given
  font, would simply 

be out of scope because for those
  scripts the fixed width model with an 

underlying buffer mimicking the display
  simply cannot be made to work.


And indeed, by up-front accepting the
  limitation of a particular design approach
it would be surprising if such
  emulators proved flexible enough to handle the
rather wide variety of writing systems
  supported by Unicode.


At best, the discussion could yield a
  few further approximations of correct
rendering that can be retrofitted to
  the particular design restrictions outlined
below, but that with luck extend the
  envelope somewhat so that a few more
writing systems can be shoehorned into
  it.


However, it appears quite hopeless to
  attempt to cover all of Unicode's scripts
on that premise.


A./









On 2/9/2019 10:25 AM, Egmont Koblinger
  via Unicode wrote:


  On Sat, Feb 9, 2019 at 7:07 PM Eli Zaretskii  wrote:


  
You need to use what HarfBuzz tells you _instead_ of wcswidth.  It is
in general wrong to use wcswidth or anything similar when you use a
shaping engine and support complex script shaping.

  
  
This approach is not viable at all.

Terminal emulators have an internal data structure that they maintain,
a matrix of character cells. Every operation is performed here, every
escape sequence is defined on this layer what it does, the cursor
position is tracked on this layer, etc. You can move the cursor to
integer coordinates, overwrite the letter in that cell, and do plenty
of other operations (like push the rest to the right by one cell). If
you change these fundamentals, most of the terminal-based applications
will fall apart big time.

This behavior has to be absolutely independent from the font. The
application running inside the terminal doesn't and cannot know what
font you use, let alone how harfbuzz is about to render it. (You can
even have no font at all, such as with the libvterm headless emulator
library, or a detached screen or tmux session; or have multiple fonts
at the same time if a screen or tmux session is attached from multiple
graphical emulators.)

So one part of a terminal emulator's code is responsible for
maintaining this matrix of characters according to the input it
receives. Another part of their code is responsible for presenting
this matrix of characters on the UI, doing the best it can.

If you say that the font should determine the logical width, you need
to start building up something brand new from scratch. You need to
have something that doesn't have concepts like "width in characters".
You need to redefine cursor movement and many other escape sequences.
You need to heavily adjust the behavior of a gazillion of software,
e.g. zip's two-column output, anything that aligns in columns (e.g.
midnight commander, tmux's vertical split etc.), the shell's (or
readline's) command editing and wrapping to multiple lines, ncurses,
and so on, all the way to e.g. fullscreen text editors like Emacs.

And then we're not talking about terminal emulators anymore, as we
know them now, but something new, something pretty different.

Terminal emulators do have strong limitations. Complex text rendering
can only work to the extent we can squeeze it into these limitations.


cheers,
egmont

Re: Encoding italic

2019-02-08 Thread Asmus Freytag via Unicode


  
  
On 2/8/2019 5:42 PM, James Kass via
  Unicode wrote:


  
  William,
  
  
  Rather than having the user insert the VS14 after every character,
  the editor might allow the user to select a span of text for
  italicization.  Then it would be up to the editor/app to insert
  the VS14s where appropriate.
  
  
  For Andrew’s example of “fête”, the user would either type the
  string:
  
  “f” + “ê” + “t” + “e”
  
  or the string:
  
  “f” + “e” +  + “t” +
  “e”.
  
  
  If the latter, the application would insert VS14 characters after
  the “f”, “e”, “t”, and “e”.  The application would not insert a
  VS14 after the combining circumflex — because the specification
  does not allow VS characters after combining marks, they may only
  be used on base characters.
  
  
  In the first ‘spelling’, since the specifications forbid VS
  characters after any character which is not a base character (in
  other words, not after any character which has a decomposition,
  such as “ê”) — the application would first need to convert the
  string to the second ‘spelling’, and proceed as above.  This is
  known as converting to NFD.
  
  
  So in order for VS14 to be a viable approach, any application
  would ① need to convert any selected span to NFD, and ② only
  insert VS14 after each base character.  And those are two
  operations which are quite possible, although they do add slightly
  to the programmer’s burden.  I don’t think it’s a “deal-killer”.
  



You are still making the assumption that selecting a different
  glyph for the base character would automatically lead to the
  selection of a different glyph for the combining mark that
  follows. That's an iffy assumption because "italics" can be
  realized by choosing a separate font (typographically, italics is
  realized as a separate typeface).
There's no such assumption built into the definition of a VS. At
  best, inside the same font, there may be an implied ligature, but
  that does not work if there's an underlying font switch.
Under the implicit assumptions bandied about here, the VS
  approach thus reveals itself as a true rich-text solution (font
  switching) albeit realized with pseudo coding rather than markup,
  markdown or escape sequences.
It's definitely no more "plain text" than HTML source code.

A./


  
  Of course, the user might insert VS14s without application
  assistance.  In which case hopefully the user knows the rules. 
  The worst case scenario is where the user might insert a VS14
  after a non-base character, in which case it should simply be
  ignored by any application.  It should never “break” the display
  or the processing; it simply makes the text for that document
  non-conformant.  (Of course putting a VS14 after “ê” should not
  result in an italicized “ê”.)
  
  
  Cheers,
  
  
  James

Re: Encoding italic

2019-02-08 Thread Asmus Freytag via Unicode


  
  
On 2/8/2019 2:08 PM, Richard Wordingham
  via Unicode wrote:


  On Fri, 8 Feb 2019 17:16:09 + (GMT)
"wjgo_10...@btinternet.com via Unicode"  wrote:


  
Andrew West wrote:

  
  

  

  Just reminding you that "The initial character in a variation
sequence  
is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4).


  
  

  
Hopefully the issue that Andrew mentions can be resolved in some way.

  
  
This is not a problem.  Instead of writing <ê, VS14>, one just writes
.

And  introducing yet another convention, which is that
  combining marks inherit the font of the base character.
Remember, italics, even though presented as a boolean attribute
  in most UIs is in fact typographically a font selection.

A./




  

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Asmus Freytag via Unicode


  
  
On 2/4/2019 1:00 PM, Richard Wordingham
  via Unicode wrote:


  To me, 'visual order' means in the dominant order of the script. 

Visual order is a term of art, meaning the characters are ordered
  in memory in the same order as they are displayed on the screen.
Whether that progresses from left to right or right to left would
  then depend on the display algorithm. When screen display
  corresponded to actual buffers in memory, those tended to be
  organized left-to-right, with lowest address at the top left.
The contrasting term is "logical order" which (largely)
  corresponds to the order in which characters are typed or spoken.
Logical order text needs to get rearranged during display
  whenever it does not correspond to visual order.

A./

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Asmus Freytag via Unicode


  
  
On 2/4/2019 11:21 AM, Costello, Roger
  L. via Unicode wrote:


  Hello Unicode Experts!

As I understand it, endian-ness applies to multi-byte words.

Endian-ness does not apply to ASCII characters because each character is a single byte.

Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. 

Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character é would appear in a hex editor as C3 A9?

/Roger




UTF-8 is a byte stream. Therefore, the order
of bytes in a multiple byte integer does not come into it.
A./

Re: Encoding italic

2019-01-31 Thread Asmus Freytag via Unicode


  
  
On 1/31/2019 12:55 AM, Tex via Unicode
  wrote:


  As with the many problems with walls not being effective, you choose to ignore the legitimate issues pointed out on the list with the lack of italic standardization for Chinese braille, text to voice readers, etc.
The choice of plain text isn't always voluntary. And the existing alternatives, like math italic characters, are problematic.

The underlying issue is the lack of rich
text support in places where users expect rich text.
The solution is to find ways to enable rich
text layers that are not full documents and make them
interoperable.
The solution is not to push this into plain
text - which then becomes lowest common denominator rich text
instead.
A./

Re: Encoding italic

2019-01-30 Thread Asmus Freytag via Unicode


  
  
On 1/30/2019 7:46 PM, David Starner via
  Unicode wrote:


  On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 wrote:

  
A new beta of BabelPad has been released which enables input, storing,
and display of italics, bold, strikethrough, and underline in plain-text

  
  
Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.



It's either "markdown" or control/tag
sequences. Both are out of band information.
And without external standard, not
interoperable.
A./

Re: Encoding italic

2019-01-30 Thread Asmus Freytag via Unicode


  
  
On 1/30/2019 4:38 PM, Kent Karlsson via
  Unicode wrote:


  I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)



No need to be sorry; we understand that the motivation is not so
  much advertising as giving a concrete example. It would be
  interesting if anything out there implements CMY(K). My
  expectation would be that this would be limited to interfaces for
  printers or their emulators.




  
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)



Muted colors are something that's become more popular as display
  hardware has improved. Modern displays are able to reproduce these
  both more predictably as well as with the necessary degree of
  contrast (although some users'/designer's fetish for low contrast
  text design is almost as bad as people randomly mixing "stark"
  FG/BG colors in the '90s.)




  

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.



Mapping all of these to CSS would be essential if you want this
  stuff to be interoperable.




  

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.

Systems that support "markdown", i.e. simplified markup to
  provide the most main-stream features of rich-text tend to do that
  with printable characters, for a reason. Perhaps two reasons.
Users find it preferable to have a visible fallback when
  "markdown" is not interpreted by a receiving system and users'
  generally like the ability to edit the markdown directly (even if,
  for convenience) there's some direct UI support for adding text
  styling.
Loading up the text with lots of invisible characters that may be
  deleted or copied out of order by someone working on a system that
  neither interprets nor displays these code points is an
  interoperability nightmare in my opinion.




  


Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" :


  
Kent Karlsson wrote:
 


  Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".


 
I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
has the advantage of not costing me USD 179, and it looks very similar
to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
are talking about: setting text display properties such as bold and
italics by means of escape sequences.
 
Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
doing, and if it does not, why we should not simply refer to the more
familiar 6429?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Asmus Freytag via Unicode


  
  
Arabic terminals and terminal emulators
  existed at the time of Unicode 1.0. If you are trying to emulate
  those services, for example so that older software can run, you
  would need to look at how these programs expected to be fed their
  data.


I see little reason to reinvent things
  here, because we are talking about emulating legacy hardware. Or
  are we not?


It's conceivable, that with modern
  fonts, one can show some characters that could not be supported on
  the actual legacy hardware, because that was limited by available
  character memory and available pre-Unicode character sets. As long
  as the new characters otherwise fit the paradigm (character per
  cell) they can be supported without other changes in the protocol
  beyond change in character set.


However, I would not expect an emulator
  to accept data in NFD for example.


A./





On 1/30/2019 2:02 PM, Richard
  Wordingham via Unicode wrote:


  On Wed, 30 Jan 2019 15:33:38 +0100
Frédéric Grosshans via Unicode  wrote:


  
Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :


  - It doesn't do Arabic shaping. In my recommendation I'm arguing
that in this mode, where shuffling the characters is the task of
the text editor and not the terminal, so should it be for Arabic
shaping using presentation form characters.  



I guess Arabic shaping is doable through presentation form
characters, because the latter are character inherited from legacy
standards using them in such solutions.

  
  
So long as you don't care about local variants, e.g. U+0763 ARABIC
LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
characters.

Basic Arabic shaping, at the level of a typewriter, is straightforward
enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
would be trickier - one cell or two?


  
But if you want to support
other “arabic like” scripts (like Syriac, N’ko), or even some LTR
complex scripts, like Myanmar or Khmer, this “solution” cannot work,
because no equivalent of “presentation form characters” exists for
these scripts

  
  
I believe combining marks present issues even in implicit modes.  In
implicit mode, one cannot simply delegate the task to normal text
rendering, for one has to allocate text to cells.  There are a number
of complications that spring to mind:

1) Some characters decompose to two characters that may otherwise lay
claim to their own cells:

U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
0654>.  Do you intend that your scheme be usable by Unicode-compliant
processes?

2) 2-part vowels, such as U+0D4A MALAYALAM VOWEL SIGN O, which
canonically decomposes into a preceding combining mark U+0D46 MALAYALAM
VOWEL SIGN E and following combining mark U+0D3E MALAYALAM VOWEL SIGN
AA.

3) Similar 2-part vowels that do not decompose, such as U+17C4 KHMER
VOWEL SIGN OO.  OpenType layout decomposes that into a preceding
'U+17C1 KHMER VOWEL SIGN E' and the second part.

4) Indic conjuncts.
(i) There are some conjuncts, such as Devanagari K.SSA, where a
display as ,  is simply unacceptable.  In some
closely related scripts, this conjunct has the status of a character.

(ii) In some scripts, e.g. Khmer, the virama-equivalent is not an
acceptable alternative to form a consonant stack.  Khmer could
equally well have been encoded with a set of subscript consonants in
the same manner as Tibetan.

(iii) In some scripts, there are marks named as medial consonants
which function in exactly the same way as <'virama', consonant>; it is
silly to render them in entirely different manners.

5) Some non-spacing marks are spacing marks in some contexts.  U+102F
MYANMAR VOWEL SIGN U is probably the best known example.

Richard.

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode


  
  
On 1/26/2019 10:08 PM, Richard
  Wordingham via Unicode wrote:


  On Sat, 26 Jan 2019 21:11:36 -0800
Asmus Freytag via Unicode  wrote:


  
On 1/26/2019 5:43 PM, Richard Wordingham via Unicode wrote:

  
  

  

  That appears to contradict Michael Everson's remark about a
Polynesian
need to distinguish the two visually.


  
  

  
Why do you need to distinguish them? To code text correctly (so the
invisible properties are what the software expects) or because a
human reader needs the disambiguation in order to follow the text?

  
  

  
The latter phenomenon is so common throughout many writing systems,
that I have difficulties buying it.

  
  
It may be a matter of literacy in Hawaiian.  If the test readership
doesn't use ʼokina, it could be confusing to have to resolve the
difference between a sentence(?) starting with one from a sentence in
single quotes. Otherwise, one does wonder why the issue should only
arise now.



one does.




  

One other possibility is that single quote punctuation is being used on
a readership used to double quote punctuation.  Double quotes would
avoid the confusion.

Choice of quotation marks is language-based and for novels, many
times there are
additional conventions that may differ by publisher.
Wonder why the publisher is forcing single quotes on them?




  


  
PS: I wasn't talking about what the Polynesians do; different part of
the world.

  
  
Why should the Polynesians be different?



I am simply stating that my evidence does not come from them. I
  have no special insight into what Polynesians do or do not do.

A./

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

On 1/26/2019 7:53 PM, Richard
  Wordingham via Unicode wrote:

  On Sun, 27 Jan 2019 01:55:29 +
James Kass via Unicode  wrote:

Richard Wordingham replied to Asmus Freytag,

 >> To make matters worse, users for languages that "should" use
 >> U+02BC aren't actually consistent; much data uses U+2019 or
 >> U+0027. Ordinary users can't tell the difference (and spell
 >> checkers seem not successful in enforcing the practice).  
 >
 > That appears to contradict Michael Everson's remark about a
 > Polynesian need to distinguish the two visually.  

Does it?

U+02BC /should/ be used but ordinary users can't tell the difference 
because the glyphs in their displays are identical, resulting in much 
data which uses U+2019 or U+0027.  I don't see any contradiction.

I had assumed that Polynesians would be writing with paper and ink.  It
depends on what 'tell the difference' means.  In normal parlance it
means that they are unaware of the difference in the symbols; you are
assuming that it means that printed material doesn't show the
difference.

In general, handwritten differences can show up in various ways.  For
example, one can find a slight, unreliable difference in the relative
positioning of characters that reflects the difference in the usage of
characters.

Of course, Asmus's facts have to be unreliable.  It's like someone
typing U+1142A NEWA LETTER MHA for Sanskrit , which we've been
assured would never happen.  There must be something wrong with reality.

There usually is :)
Our leaders tell us so.
Anyway, most of us don't use U+2019 where proper unless we happen
  to use 
  software that makes the translation from U+0027 for us . . .
When it picks the left single quote by mistake, that's something
  we can spot and nudge it. When the difference is invisible people
  will type the wrong thing - like typesetting whole books with the
  wrong Arabic character because it happens to share the same shape
  in that position with another one.

A./

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

On 1/26/2019 6:25 PM, Michael Everson
via Unicode wrote:

On 27 Jan 2019, at 01:37, Richard Wordingham via Unicode wrote:

I’ll be publishing a translation of Alice into Ancient Greek in due
course. I will absolutely only use U+2019 for the apostrophe. It
would be wrong for lots of reasons to use U+02BC for this.

Please list them.

The Greek use is of an apostrophe. Often a mark elision (as here), that’s what 2019 is for.

02BC is a letter. Usually a glottal stop.

I didn’t follow the beginning of this. Evidently it has something to do with word selection of d’ + a space + what follows. If that’s so, then there’s no argument at all for 02BC. It’s a question of the space, and that’s got nothing to do with the identity of the apostrophe.

Will your coding decision be machine readable for the readership?

I don’t know what you mean by “readable”.

Moreover, implementations of U+02BC need to be revised. In the
context of Polynesian languages, it is impossible to use U+02BC if it
is _identical_ to U+2019. Readers cannot work out what is what. I
will prepare documentation on this in due course.

It looks as though you've found a new character - or a revived
distinction.

It may not be “revived’. In origin, linguists took the lead-type 2019 and used it as a consonant letter. Now, in the 21st century, where Harry Potter is translated into Hawaiian, and where Harry Potter has glottals alongside both single and double quotation marks,

The use of quotation marks is language dependent. There is no
cast in stone requirement to use single quotation marks with
languages where it causes difficulties.

English uses apostrophe and single quotation marks - the former
are a bit more rare compared to when that symbol is used in some
languages, but in principle the same confusion applies and so far
hasn't prompted anyone to follow the lead of the French in choice
of quotation marks . . .

the 02BC’s need to be bigger or the text can’t be read easily. In our work we found that a vertical height of 140% bigger than the quotation mark improved legibility hugely. Fine typography asks for some other alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation.

It somehow seems to me that an evolution of the glyph shape of
02BC in a direction of increased distinction from U+2019 is
something that Unicode has indeed made possible by a separate
encoding. However, that evolution is a matter of ALL the language
communities that use U+02BC as part of their orthography, and
definitely NOT something were Unicode can be permitted to take a
lead. Unicode does not *recommend* glyphs for letters.
However, as a publisher, you are of course free to experiment and
to see whether your style becomes popular.
There is a concern though, that your choice may appeal only to
some languages that use this code point and not become universally
accepted.

A./

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode


  
  
On 1/26/2019 5:43 PM, Richard
  Wordingham via Unicode wrote:


  On Sat, 26 Jan 2019 17:11:49 -0800
Asmus Freytag via Unicode  wrote:


  
To make matters worse, users for languages that "should" use U+02BC
aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
users can't tell the difference (and spell checkers seem not
successful in enforcing the practice).

  
  
That appears to contradict Michael Everson's remark about a Polynesian
need to distinguish the two visually.

Richard.



Why do you need to distinguish them? To code
text correctly (so the invisible properties are what the
software expects) or because a human reader needs the
disambiguation in order to follow the text?
The former is like first coding a different
character for a decimal point from an ordinary period, then
deciding to make it look different so you know you typed the
right one. The latter is like saying people can't handle using
the same symbol (dot on the baseline) for two different
functions. 
  
The latter phenomenon is so common
throughout many writing systems, that I have difficulties buying
it.
A./
PS: I wasn't talking about what the
Polynesians do; different part of the world.

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode


  
  
On 1/26/2019 3:02 AM, Mark Davis ☕️ via
  Unicode wrote:


  
  

  > breaking
  selection for "d'Artagnan" or "can't" into two is overly
  fussy.
  
  
  
True, and that is not what U+2019
  does; it does not break medially.
  

  



Not everyone seems to have got the word . . . but that's not
  Unicode's fault. But shows that picking specific character codes
  from among a set that are identical except for (invisible)
  properties could be a losing game if widely deployed software
  can't be relied on to honor such finesse.

A./
PS: btw, the Root Zone of the DNS will not support U+02BC as a
  "letter". The "invisible" distinction in property is irrelevant
  when it comes to identifies that are identified visually by users,
  and further, we don't really want to encourage people to use it to
  register words intended to contain apostrophes. Since we can't
  have ordinary apostrophes or U+2019, we can't have U+02BC looking
  like it might be one of the others.
To make matters worse, users for languages that "should" use
  U+02BC aren't actually consistent; much data uses U+2019 or
  U+0027. Ordinary users can't tell the difference (and spell
  checkers seem not successful in enforcing the practice).


  

  


  
  

  

  

  

  

  
  Mark


  

  

  

  

  

  

  
  

  
  
  
    On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag
  via Unicode <unicode@unicode.org> wrote:


  
On
  1/25/2019 9:39 AM, James Tauber via Unicode wrote:


  Thank you, although the word break does
still affect things like double-clicking to select.


And people do seem to want to use U+02BC for this
  reason (and I'm trying to articulate why that isn't
  what U+02BC is meant for).


  

For normal edition operations, breaking selection for
  "d'Artagnan" or "can't" into two is overly fussy.
No wonder people get frustrated.

A./


  
James
  
  
  
On Fri,
  Jan 25, 2019 at 12:34 PM Mark Davis ☕️ <m...@macchiato.com>
  wrote:


  

  
U+2019 is normally
  the character used, except where the ’ is
  considered a letter. When it is between
  letters it doesn't cause a word break, but
  because it is also a right single quote, at
  the end of words there is a break. Thus in a
  phrase like «tryin’ to go» there is a word
  break after the n, because one can't tell.


So something like "δ’
  αρχαια" (picking a phrase at random) would
  have a word break after the delta. 



Word break: 

  

  δ’ αρχαια 

  



However, there is no
  line break between them (which is the
  more important operation in normal usage).
  Probably not worth tailoring the word break.


Line break:

Re: Encoding italic

2019-01-25 Thread Asmus Freytag (c) via Unicode


On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
Assuming some mechanism for italics is added to Unicode,  when 
converting between the new plain text and HTML there is insufficient 
information to correctly convert to HTML. many elements may have 
italic stying and there would be no meta information in Unicode to 
indicate the appropriate HTML element.




So, we would be creating an interoperability issue.

A./





On Friday, 25 January 2019, wjgo_10...@btinternet.com 
<mailto:wjgo_10...@btinternet.com> via Unicode <mailto:unicode@unicode.org>> wrote:


    Asmus Freytag wrote;

Other schemes, like a VS per code point, also suffer from
being different in philosophy from "standard" rich text
approaches. Best would be as standard extension to all the
messaging systems (e.g. a common markdown language, supported
by UI).     A./


Yet that claim of what would be best would be stateful and
statefulness is the very thing that Unicode seeks to avoid.

Plain text is the basic system and a Variation Selector mechanism
after each character that is to become italicized is not stateful
and can be implemented using existing OpenType technology.

If an organization chooses to develop and use a rich text format
then that is a matter for that organization and any changing of
formatting of how italics are done when converting between plain
text and rich text is the responsibility of the organization that
introduces its rich text format.

Twitter was just an example that someone introduced along the way,
it was not the original request.

Also this is not only about messaging. Of primary importance is
the conservation of texts in plain text format, for example, where
a printed book has one word italicized in a sentence and the text
is being transcribed into a computer.

William Overington
Friday 25 January 2019



--
Andrew Cunningham
lang.supp...@gmail.com <mailto:lang.supp...@gmail.com>

Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Asmus Freytag via Unicode


  
  
On 1/25/2019 10:05 AM, James Kass via
  Unicode wrote:


  
  For U+2019, there's a note saying 'this is the preferred character
  to use for apostrophe'.
  
  
  Mark Davis wrote,
  
  
  > When it is between letters it doesn't cause a word break, ...
  
  
  Some applications don't seem to get that.  For instance, the
  spellchecker for Mozilla Thunderbird flags the string "aren" for
  correction in the word "aren’t", which suggests that users trying
  to use preferred characters may face uphill battles.
  
  
  



James, by now it's unclear whether your ' is 2019 or 02BC.
Spellcheckers are truly dumb sometimes when "user perceived
  words" don't match what the fussy prescriptionistas ordain.
And then you get parts of perfectly valid "words" rejected, and
  can't even fix them with overrides, because the override doesn't
  accept the whole _expression_.

A./

Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Asmus Freytag via Unicode


  
  
On 1/25/2019 9:39 AM, James Tauber via
  Unicode wrote:


  
  Thank you, although the word break does still
affect things like double-clicking to select.


And people do seem to want to use U+02BC for this reason
  (and I'm trying to articulate why that isn't what U+02BC is
  meant for).


  

For normal edition operations, breaking selection for
  "d'Artagnan" or "can't" into two is overly fussy.
No wonder people get frustrated.

A./


  
James
  
  
  
On Fri, Jan 25, 2019 at 12:34
  PM Mark Davis ☕️  wrote:


  

  
U+2019 is normally the
  character used, except where the ’ is considered a
  letter. When it is between letters it doesn't cause a
  word break, but because it is also a right single
  quote, at the end of words there is a break. Thus in a
  phrase like «tryin’ to go» there is a word break after
  the n, because one can't tell.


So something like "δ’ αρχαια"
  (picking a phrase at random) would have a word break
  after the delta. 



Word break: 

  

  δ’ αρχαια 

  



However, there is no line
break between them (which is the more important
  operation in normal usage). Probably not worth
  tailoring the word break.


Line break:

  

  
δ’ αρχαια 
  

  




  

  

  

  

  

Mark
  
  

  

  

  

  

  

  


  

  
  
  
On Fri, Jan 25, 2019 at 1:10 PM James Tauber
  via Unicode 
  wrote:


  

  There seems some debate amongst digital
classicists in whether to use U+2019 or U+02BC to
represent the apostrophe in Ancient Greek when
marking elision. (e.g. δ’ for δέ preceding a word
starting with a vowel).
  
  
  It seems to me that U+2019 is the technically
correct choice per the Unicode Standard but it is
not without at least one problem: default word
breaking rules.
  
  
  I'm trying to provide guidelines for digital
classicists in this regard.
  
  
  Is it correct to say the following:
  
  
  1) U+2019 is the correct character to use for the
apostrophe in Ancient Greek when marking elision. 
  2) U+02BC is a misuse of a modifier for this
purpose
  3) However, use of U+2019 (unlike U+02BC) means
the default Word Boundary Rules in UAX#29 will
(incorrectly) exclude the apostrophe from the word
token
  4) And use of U+02BC (unlike U+2019) means Glyph
Cluster Boundary Rules in UAX#29 will (incorrectly)
include the apostrophe as part of a glyph cluster
with the previous letter
  5) The correct solution is to tailor the Word
Boundary Rules in the case of Ancient Greek to treat
U+2019 as not breaking a word (which

Re: Encoding italic

2019-01-25 Thread Asmus Freytag (c) via Unicode


On 1/25/2019 1:06 AM, wjgo_10...@btinternet.com wrote:

Asmus Freytag wrote;

Other schemes, like a VS per code point, also suffer from being 
different in philosophy from "standard" rich text approaches. Best 
would be as standard extension to all the messaging systems (e.g. a 
common markdown language, supported by UI). A./


Yet that claim of what would be best would be stateful and 
statefulness is the very thing that Unicode seeks to avoid. 


All rich text is stateful, and rich text is very widely used and 
cut tends to work rather well among applications that support it, 
as do conversions of entire documents. Trying to duplicate it with "yet 
another mechanism" is a doubtful achievement, even if it could be made 
"stateless".


A./

Re: Encoding italic

2019-01-24 Thread Asmus Freytag (c) via Unicode


On 1/24/2019 11:14 PM, Tex wrote:


I am surprised at the length of this debate, especially since the 
arguments are repetitive…


That said:

Twitter was offered as an example, not the only example just one of 
the most ubiquitous. Many messaging apps and other apps would benefit 
from italics. The argument is not based on adding italics to twitter.


Most apps today have security protections that filter or translate 
problematic characters. If the proposal would cause “normalization” 
problems, adding the proposed characters to the filter lists or 
substitution lists would not be a big burden.


The biggest burden would be to the apps that would benefit, to add 
italicizing and editing capabilities.


The "normalization" is when you import to rich text, you don't want 
competing formatting instructions. Getting styled character codes 
normalized to styling of character runs is the most difficult, that's 
why the abuse of math italics really is abuse in terms of interoperability.


Other schemes, like a VS per code point, also suffer from being 
different in philosophy from "standard" rich text approaches. Best would 
be as standard extension to all the messaging systems (e.g. a common 
markdown language, supported by UI).


A./


tex

*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Thursday, January 24, 2019 10:34 PM
*To:* unicode@unicode.org
*Subject:* Re: Encoding italic

On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote:

But the root problem isn't the kludge, it's the lack of
functionality in these systems: if Twitter etc. simply implemented
some styling on their own, the whole thing would be a moot point.
Essentially, this is trying to add features to Twitter without
waiting for their development team.

Interoperability is not an issue, since in modern computers
copying and pasting styled text between apps works just fine.

Yep, that's what this is: trying to add features to some platforms 
that could very simply be added by the  respective developers while in 
the process causing a normalization issue (of sorts) everywhere else.


A./

Re: Encoding italic

2019-01-24 Thread Asmus Freytag via Unicode


  
  
On 1/24/2019 9:44 PM, Garth Wallace via
  Unicode wrote:


  But the root problem isn't the kludge, it's the lack of
functionality in these systems: if Twitter etc. simply
implemented some styling on their own, the whole thing would be
a moot point. Essentially, this is trying to add features to
Twitter without waiting for their development team.

  
  Interoperability is not an issue, since in modern computers
copying and pasting styled text between apps works just fine.  

Yep, that's what this is: trying to add
features to some platforms that could very simply be added by
the  respective developers while in the process causing a
normalization issue (of sorts) everywhere else. 
  
A./

Re: Encoding italic (was: A last missing link)

2019-01-20 Thread Asmus Freytag via Unicode


  
  
On 1/20/2019 2:55 PM, James Kass via
  Unicode wrote:


  
  On 2019-01-20 10:49 PM, Garth Wallace wrote:
  
  I think the real solution is for Twitter
to just implement basic styling and make this a moot point.

  
  
  At which time it would only become a moot point for Twitter
  users.  There's also Facebook and other on-line groups.  Plus
  scholars and linguists.  And interoperability.
  
  
  

Interoperability exists when multiple
parties support the same standard.
The fallacy is the assumption that because
Unicode is so widely supported, it is the best standard to
codify such interoperability.
It overlooks the fact that each new feature
(for it to work as intended, not just as fallback) needs to be
supported by everyone. 
  
For the use case (enable styling for "chat"
or messenging services and relatives) a standard that defines
how to handle a subset of basic styling among "consenting"
platforms is the obvious answer.
The starting set for such styling should be
"character" styles that are applicable inside a "paragraph".
A./

Re: Encoding italic (was: A last missing link)

2019-01-20 Thread Asmus Freytag via Unicode

On 1/20/2019 2:49 PM, Garth Wallace via
  Unicode wrote:

I think the real solution is for Twitter to just
  implement basic styling and make this a moot point.

Twitter FB and CO should implement a common "MarkDown" scheme or
  some other common formatting subset. 

A./

  On Sun, Jan 20, 2019 at 2:37 AM Andrew West via
Unicode  wrote:

  On Sun, 20
Jan 2019 at 03:16, James Kass via Unicode
 wrote:
>
> Possible approaches include:
>
> 3 - Open/Close punctuation treatment
> Stateful.  Works on ranges.  Not currently supported in
plain-text.
> Could be supported in applications which can take a
text string URL and
> make it a clickable link.  Default appearance in
nonsupporting apps may
> resemble existing plain-text italic kludges such as
slashes.  The ASCII
> is already in the character string.

A possibility that I don't think has been mentioned so far
would be to
use the existing tag characters (E0020..E007F). These are no
longer
deprecated, and as they are used in emoji flag tag
sequences, software
already needs to support them, and they should just be
ignored by
software that does not support them. The advantages are that
no new
characters need to be encoded, and they are flexible so that
tag
sequences for start/end of italic, bold, fraktur,
double-struck,
script, sans-serif styles could be defined. For example
start and end
of italic styling could be defined as the tag sequences
 and 
(E003C E0069 E003E and E003C E002F E0069 E003E).

Andrew

Re: NNBSP

2019-01-19 Thread Asmus Freytag via Unicode

On 1/19/2019 3:53 AM, James Kass via
  Unicode wrote:

  Marcel Schneider wrote,

  > When you ask for knowing the foundations and that knowledge
  is persistently refused,

  > you end up believing that those foundations just can’t be
  told.

  >

  > Note, too, that I readily ceased blaming UTC, and shifted the
  blame elsewhere, where it

  > actually belongs to.

  Why not think of it as a learning curve?  Early concepts and
  priorities were made from a lower position on that curve.  We can
  learn from the past and apply those lessons to the future, but a
  post-mortem seldom benefits the cadaver.

+1. Well put about the cadaver.

  Minutiae about decisions made long ago probably exist, but may be
  presently poorly indexed/organized and difficult to search/access.
  As the collection of encoding history becomes more sophisticated
  and the searching technology becomes more civilized, it may become
  easier to glean information from the archives.

  (OT - A little humor, perhaps...

  On the topic of Francophobia, it is true that some of us do not
  like dead generalissimos.  But most of us adore the French for
  reasons beyond Brigitte Bardot and bon-bons.  Cuisine, fries, dip,
  toast, curls, culture, kissing, and tarts, for instance.  Not to
  mention cognac and champagne!)

It is time for this discussion to be
  moved to a small group of people interested in hashing out
actual proposals for submission. Is there anyone here who
would like to collaborate with Marcel to find a solution for
European number formatting that
(1) fully supports the typographic best
practice

(2) identifies acceptable fall backs

(3) is compatible with existing legacy
practice, even if that does not conform to (1) or (2)
(4) includes necessary adjustments to CLDR 

If nobody here is interested in working on
that, discussing this further on this list will not serve a
useful purpose, as nothing will change in Unicode without a
well-formulated proposal that covers the four parameters laid
out here.
A./

Re: Encoding italic

2019-01-19 Thread Asmus Freytag via Unicode

On 1/19/2019 12:34 PM, James Kass via
  Unicode wrote:

  On 2019-01-19 6:19 PM, wjgo_10...@btinternet.com wrote:

  > It seems to me that it would be useful to have some codes
  that are

  > ordinary characters in some contexts yet are control codes in
  others, ...

  Italics aren't a novel concept.  The approach for encoding new
  characters is that  conventions for them exist and that people
  *are* exchanging them, people have exchanged them in the past, or
  that people demonstrably *need* to exchange them.

  Excluding emoji, any suggestion or proposal whose premise is "It
  seems to me that  it would be useful if characters supporting
  ..." is doomed to be deemed out of scope for
  the standard.

+1. It's the worst kind of "leading
standardization".

Re: NNBSP

2019-01-19 Thread Asmus Freytag via Unicode


  
  
On 1/18/2019 11:34 PM, Marcel Schneider
  via Unicode wrote:


  
Current
  practice in electronic publishing was to use a non-breakable
  thin space, Philippe Verdy reports. Did that information come
  in somehow?

==> probably not in the early days. Y

  
  Perhaps it was ignored from the beginning on, like Philippe Verdy
  reports that UTC ignored later demands, getting users upset. 

==> for reasons given in another post, I tend to not give much
  credit to these suggestions. 

For one, many worthwhile additions / changes to Unicode depend on
  getting written up in proposal form and then championed by
  dedicated people willing to see through the process. Usually,
  Unicode has so many proposals to pick from that at each point
  there are more than can be immediately accommodated. There's no
  automatic response to even issues that are "known" to many people.
"Demands" don't mean a thing, formal proposals, presented and
  then refined based on feedback from the committee is what puts
  issues on the track of being resolved.

 That
  leaves us with the question why it did so, downstream your
  statement that it was not what I ended up suspecting.
  
  Does "Y" stand for the peace symbol?

==> No, my thumb sometimes touches the touchpad and flicks the
  cursor while I type. I don't always see where some character end
  up. Or, I start a sentence and the phone rings. Or any of a number
  of scenarios. Take your pick.

  
 
 
  ISO 31-0 was published in 1992, perhaps too late for Unicode.
  It is normally understood that the thousands separator should
  not have the width of a digit. The allaged reason is security.
  Though on a typewriter, as you state, there is scarcely any
  other option. By that time, all computerized text was fixed
  width, Philippe Verdy reports. On-screen, I figure out, not in
  book print

==> much book printing was also done by photomechanically
  reproducing typescript at that time. Not everybody wanted to
  pay typesetters and digital typesetting wasn't as advanced. I
  actually did use a digital phototypesetter of the period a few
  years before I joined Unicode, so I know. It was more powerful
  than a typewriter, but not as powerful as TeX or later the
  Adobe products.
For one, you didn't typeset a page, only a column of text,
  and it required manual paste-up etc.

  
  Did you also see typewriters with proportional advance width (and
  interchangeable type wheels)? That was the high end on the
  typewriter market. (Already mentioned these typewriters in a
  previous e‑mail.) Books typeset this way could use bold and (less
  easy) italic spans.
Yes, I definitely used an IBM Selectric for many years with
  interchangeable type wheels, but I don't remember using
  proportional spacing, although I've seen it in the kinds of
  "typescript" books I mentioned. Some had that crude approximation
  of typesetting.
When Unicode came out, that was no longer the state of the art as
  TeX and laser printers weren't limited that way.
However, the character sets from which Unicode was assembled (or
  which it had to match, effectively) were designed earlier - during
  those times. And we inherited some things (that needed to be
  supported so round-trip mapping of data was possible) but that
  weren't as well documented in their particulars.
I'm sure we'll eventually deprecate some and clean up others,
  like the Mongolian encoding (which also included some stuff that
  was encoded with an understanding that turned out less solid in
  retrospect than we had thought at the time).
Something the UTC tries very hard to avoid, but nobody is
  perfect. It's best therefore to try not to ascribe non-technical
  motives to any action or inaction of the UTC. What outsiders see
  is rarely what actually went down, and the real reasons for things
  tend to be much less interesting from an interpersonal  or
  intercultural perspective. So best avoid that kind of topic
  altogether and never use it as basis for unfounded recriminations.
A./

Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode


  
  
On 1/18/2019 2:46 PM, Shawn Steele via
  Unicode wrote:


  >> That
should not impact all other users out there interested in a
civilized layout.
  I’m not sure
that the choice of the word “civilized” adds value to the
conversation.  We have pretty much zero feedback that the OS’s
French formatting is “uncivilized” or that the NNBSP is required
for correct support.  
  >> As long
as SegoeUI has NNBSP support, no worries, that’s what CLDR data
is for.
  For
compatibility, I’d actually much prefer that CLDR have an alt
“best practice” field that maintained the existing U+00A0
behavior for compatibility, yet allowed applications wanting the
newer typographic experience to opt-in to the “best practice”
alternative data.  As applications became used to the idea of an
alternative for U+00A0, then maybe that could be flip-flopped
and put U+00A0 into a “legacy” alt form in a few years.
  Normally I’m all
for having the “best” data in CLDR, and there are many locales
that have data with limited support for whatever reasons. 
U+00A0 is pretty exceptional in my view though, developers have
been hard-coding dependencies on that value for ½ a century
without even realizing there might be other types of
non-breaking spaces.  Sure, that’s not really the best practice,
particularly in modern computing, but I suspect you’ll still
find it taught in CS classes with little regard to things like
NNBSP.

Shwan, 
  
having information on "common fallbacks"
would be useful. If formatting numbers, I may be free to pick
the "best", but when parsing for numbers I may want to know what
deviations from "best" practice I can expect.
A./

Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode


  
  
On 1/18/2019 2:05 PM, Marcel Schneider
  via Unicode wrote:


  
  On 18/01/2019 20:09, Asmus Freytag
via Unicode wrote:
  
  

Marcel,
about your many detailed *technical* questions about the
  history of character properties, I am afraid I have no
  specific recollection.
  
  Other List Members are welcome to join in, many of whom are aware
  of how things happened. My questions are meant to be rather
  simple. Summing up the premium ones:
  
Why does UTC ignore the need of a non-breakable thin space?
Why did UTC not declare PUNCTUATION SPACE non-breakable?
  
  A less important information would be how extensively
typewriters with proportional advance width were used to write
books ready for print.
  
  Another question you do answer below:
  
  
French is not the only language that uses a space to group
  figures. In fact, I grew up with thousands separators being
  spaces, but in much of the existing publications or documents
  there was certainly a full (ordinary) space being used. Not
  surprisingly, because in those years documents were
  typewritten and even many books were simply reproduced from
  typescript.
When it comes to figures, there are two different types of
  spaces.
One is a space that has the same width a digit and is used in
  the layout of lists. For example, if you have a leading
  currency symbol, you may want to have that lined up on the
  left and leave the digits representing the amounts "ragged".
  You would fill the intervening spaces with this "lining" space
  character and everything lines up.
  
  That is exactly how I understood hot-metal typesetting of tables.
  What surprises me is why computerized layout does work the same
  way instead of using tabulations and appropriate tab stops (left,
  right, centered, decimal [with all decimal separators lining up
  vertically).

==> At the time Unicode was first created (and definitely
  before that, during the time of non-universal character sets) many
  applications existed that used a "typewriter model" and worked by
  space fill rather than decimal-point tabulation.

From today's perspective that older model is inflexible and not
  the best approach, but it is impossible to say how long this
  legacy approach hung on in some places and how much data might
  exist that relied on certain long-standing behaviors of these
  space characters.
For a good solution, you always need to understand
(1) the requirement of your "index" case (French, in this case)
(2) how it relates to similar requirements in (all!) other
  languages / scripts
(3) how it relates to actual legacy practice 

(3a) what will suddenly no longer work if you change the
  properties on some character
(3b) what older data will no longer work if the effective
  behavior of newer applications changes


  
In lists like that, you can get away with not using a narrow
  thousands separator, because the overall context of the list
  indicates which digits belong together and form a number.
  Having a narrow space may still look nicer, but complicates
  the space fill between the symbol and the digits.
  
  It does not, provided that all numbers have thousands separators,
  even if filling with spaces. It looks nicer because it’s more
  legible.
  
Now for numbers in running text using an ordinary space has
  multiple drawbacks. It's definitely less readable and, in
  digital representation, if you use 0020 you don't communicate
  that this is part of a single number that's best not broken
  across lines.
  
  Right.
  
The problem Unicode had is that it did not properly
  understand which of the two types of "numeric" spaces was
  represented by "figure space". (I remember that we had
  discussions on that during the early years, but that they were
  not really resolved and that we moved on to other issues, of
  which many were demanding attention).
  
  You were discussing whether the thousands separator should have
  the width of a digit or the width of a period? Consistently with
  many other choices, the solution would have been to encode them
  both as non-breakable, the more as both were at hand, leaving the
  choice to the end-user.

==> Right, but remember, we started off encoding a set of
  spaces that existed before Unicode (in some other character sets)
  and implicitly made the assumption that th

Re: Encoding italic (was: A last missing link)

2019-01-18 Thread Asmus Freytag via Unicode


  
  
I would full agree and I think Mark puts it really well in the
  message below why some of the proposals brandished here are no
  longer plain text but "not-so-plain" text.
I think we are better served with a solution that provides some
  form of "light" rich text, for basic emphasis in short messages.
  The proper way for this would be some form of MarkDown standard
  shared across vendors, and perhaps implemented in a way that users
  don't necessarily need to type anything special, but that, if
  exported to "true" plain text, it turns into the source format for
  the "light" rich text.
This is an effort that's out of scope for Unicode to implement,
  or, I should say, if the Consortium were to take it on, it would
  be a separate technical standard from The Unicode Standard.

A./
PS: I really hate the creeping expansion of pseudo-encoding via
  VS characters. The only worse thing is adding novel control
  functions.



On 1/18/2019 7:51 AM, Mark E. Shoulson
  via Unicode wrote:

On 1/16/19
  6:23 AM, Victor Gaultney via Unicode wrote:
  
  

Encoding 'begin italic' and 'end italic' would introduce
difficulties when partial strings are moved, etc. But that's no
different than with current punctuation. If you select the
second half of a string that includes an end quote character you
end up with a mismatched pair, with the same problems of
interpretation as selecting the second half of a string
including an 'end italic' character. Apps have to deal with it,
and do, as in code editors.


  
  It kinda IS different.  If you paste in half a string, you get a
  mismatched or unmatched paren or quote or something.  A typo, but
  a transient one.  It looks bad where it is, but everything else is
  unaffected.  It's no worse than hitting an extra key by mistake.
  If you paste in a "begin italic" and miss the "end italic",
  though, then *all* your text from that point on is affected!  (Or
  maybe "all until a newline" or some other stopgap ending, but
  that's just damage-control, not damage-prevention.)  Suddenly,
  letters and symbols five words/lines/paragraphs/pages look
  different, the pagination is all altered (by far more than merely
  a single extra punctuation mark, since italic fonts generally are
  narrower than roman).  It's a disaster.
  
  
  No.  This kind of statefulness really is beyond what Unicode is
  designed to cope with.  Bidi controls are (almost?) the sole
  exception, and even they cause their share of headaches.  Encoding
  separate _text_ italics/bold is IMO also a disastrous idea, but
  I'm not putting out reasons for that now.  The only really
  feasible suggestion I've heard is using a VS in some fashion.
  (Maybe let it affect whole words instead of individual
  characters?  Makes for fewer noisy VSs, but introduces a whole
  other host of limitations (how to italicize part of a word, how to
  italicize non-letters...) and is also just damage-control, though
  stronger.)
  
  
  Apps (and font makers) can also choose how
to deal with presenting strings of text that are marked as
italic. They can choose to present visual symbols to indicate
begin/end, such as /this/. Or they can present it using the
italic variant of the font, if available.


  
  At which point, you have invented markdown.  Instead of making
  Unicode declare it, just push for vendors everywhere to recognize
  /such notation/ as italics (OK, I know, you want dedicated
  characters for it which can't be confused for anything else.)
  
  
  
  - Those who develop plain text apps
(social media in particular) don't have to build in a whole
markup/markdown layer into their apps


  
  With the complexity of writing an social media app, a markup layer
  is really the least of the concerns when it comes to simplifying.
  
  

- Misuse of math chars for pseudo-italic would likely disappear


- The text runs between markers remain intact, so they need no
special treatment in searching, selecting, etc.


- It finally, and conclusively, would end the decades of the
mess in HTML that surrounds  and .


  
  Adding _another_ solution to something will *never* "conclusively
  end" anything.  On a good day, you can hope it will swamp the
  others, but they'll remain at least in legacy.  More likely, it
  will just add one more way to be confused and another side to the
  mess.  (People have pointed out here about the difficulties of
  distinguishing or

Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode


  
  
Marcel,
about your many detailed *technical* questions about the history
  of character properties, I am afraid I have no specific
  recollection.
French is not the only language that uses a space to group
  figures. In fact, I grew up with thousands separators being
  spaces, but in much of the existing publications or documents
  there was certainly a full (ordinary) space being used. Not
  surprisingly, because in those years documents were typewritten
  and even many books were simply reproduced from typescript.
When it comes to figures, there are two different types of
  spaces.
One is a space that has the same width a digit and is used in the
  layout of lists. For example, if you have a leading currency
  symbol, you may want to have that lined up on the left and leave
  the digits representing the amounts "ragged". You would fill the
  intervening spaces with this "lining" space character and
  everything lines up.
In lists like that, you can get away with not using a narrow
  thousands separator, because the overall context of the list
  indicates which digits belong together and form a number. Having a
  narrow space may still look nicer, but complicates the space fill
  between the symbol and the digits.
Now for numbers in running text using an ordinary space has
  multiple drawbacks. It's definitely less readable and, in digital
  representation, if you use 0020 you don't communicate that this is
  part of a single number that's best not broken across lines.
The problem Unicode had is that it did not properly understand
  which of the two types of "numeric" spaces was represented by
  "figure space". (I remember that we had discussions on that during
  the early years, but that they were not really resolved and that
  we moved on to other issues, of which many were demanding
  attention).
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language using
  some form of blank as a thousands separator - solving only the
  French issue is not enough. We should not do this a language at a
  time. Do you have colleagues in Germany and other countries that
  can confirm whether their practice matches the French usage in all
  details, or whether there are differences? (Including differently
  acceptability of fallback renderings...).
(2) have a solution that works for lining figures as well as
  separators.
(3) have a solution that understands ALL uses of spaces that are
  narrower than normal space. Once a character exists in Unicode,
  people will use it on the basis of "closest fit" to make it do
  (approximately) what they want. Your proposal needs to address any
  issues that would be caused by reinterpreting a character more
  narrowly that it has been used. Only by comprehensively
  identifying ALL uses of comparable spaces in various languages and
  scripts, you can hope to develop a solution that doesn't simply
  break all non-French text in favor of supporting French
  typography.
Perhaps you see why this issue has languished for so long:
  getting it right is not a simple matter.
A./

Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode


  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:


  

  
Covering existing
character sets (National, International and Industry)
was an (not "the") important goal at
the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable
data migration to Unicode as well as enable
Unicode-based systems to process and display non-Unicode
data (by conversion). 
  
  

  
  I’d take this as a touchstone to infer that
there were actual data files including standard typographic
spaces as encoded in U+2000..U+2006, and electronic table layout
using these: “U+2007 figure space has a fixed width,
  known as tabular width, which is the same width as digits used in
  tables. U+2008 punctuation space is a space defined to be the same
  width as a period.” 
  Is that correct?
May I remind you that the beginnings of Unicode predate the
  development of the world wide web. By 1993 the web had developed
  to where it was possible to easily access material written in
  different scripts and language, and by today it is certainly
  possible to "sample" material to check for character usage. 

When Unicode was first developed, it was best to work from the
  definition of character sets and to assume that anything encoded
  in a give set was also used somewhere. Several corporations had
  assembled supersets of character sets that their products were
  supporting. The most extensive was a collection from IBM. (I'm
  blanking out on the name for this).
These collections, which often covered international standard
  character sets as well, were some of the prime inputs into the
  early drafts of Unicode. With the merger with ISO 10646 some
  characters from that effort, but not in the early Unicode drafts,
  were also added.
The code points from U+2000..U+2008 are part of that early
  collection.
Note, that prior to Unicode, no character set standard described
  in detail how characters were to be used (with exception, perhaps
  of control functions). Mostly, it was assumed that users knew what
  these characters were and the function of the character set was
  just to give a passive enumeration.
Unicode's character property model changed all that - but that
  meant that properties for all of the characters had to be
  determined long after they were first encoded in the original
  sources, and with only scant hints of the identity of what these
  were intended to be. (Often, the only hint was a character name
  and a rather poor bitmapped image).
If you want to know the "legacy" behavior for these characters,
  it is more useful, therefore, to see how they have been supported
  in existing software, and how they have been used in documents
  since then. That gives you a baseline for understanding whether
  any change or clarification of the properties of one of these code
  points will break "existing practice".
Breaking existing practice should be a dealbreaker, no matter how
  well-intentioned a change is. The only exception is where existing
  implementations are de-facto useless, because of glaring
  inconsistencies or other issues. In such exceptional cases,
  deprecating some interpretations of  character may be a net win.
However, if there's a consensus interpretation of a given
  character the you can't just go in and change it, even if it would
  make that character work "better" for a given circumstance: you
  simply don't know (unless you research widely) how people have
  used that character in documents that work for them. Breaking
  those documents retroactively, is not acceptable.
A./

Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode


  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:

I understand only better
  why a significant majority of UTC is hating French.
  
Francophobia is also palpable in Canada, beyond any
technical reasons, especially in the IT industry. Hence the
position of UTC is far from isolated. If ethic and personal
considerations inflect decision-making, they should consistently
be an integral part of discussions here. In that vein, I’d
mention that by the time when Unicode was developed, there was a
global hatred against France, that originated in French colonial
and foreign politics since WWII, and was revived a few years ago
by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟
and killing the crew’s photographer, in the port of Auckland.
That crime triggered a peak of anger.
Again, my recollections do not support
any issues of Francophobia.
The Unicode Technical committee has always
had French people on board, from the beginning, and I have
witnessed no issues where they took up a different technical
position based on language. Quite the opposite, the UTC
generally appreciates when someone can provide native insights
into the requirements for supporting a given language. How best
to realize these requirements then becomes a joint effort.
  
If anything, the Unicode Consortium saw itself from the beginning
  in contrast to an IT culture for which internationalization at
  times was still something of an afterthought.
Given all that, I find your suggestions and  implications deeply
  hurtful and hope you will find a way to avoid a repetition in the
  future.
May I suggest that trying to rake over the past and apportion
  blame is generally less productive than moving forward and
  addressing the outstanding problems.
A./

Re: NNBSP

2019-01-17 Thread Asmus Freytag via Unicode


  
  
On 1/17/2019 9:35 AM, Marcel Schneider
  via Unicode wrote:


  
[quoted mail]
  


But the French "espace fine insécable" was requested
  long long before Mongolian was discussed for encodinc in
  the UCS. The problem is that the initial rush for French
  was made in a period where Unicode and ISO were competing
  and not in sync, so no agreement could be found, until
  there was a decision to merge the efforts. Tge early rush
  was in ISO still not using any character model but a glyph
  model, with little desire to support multiple whitespaces;
  on the Unicode side, there was initially no desire to
  encode all the languages and scripts, focusing initially
  only on trying to unify the existing vendor character sets
  which were already implemented by a limited set of
  proprietary vendor implementations (notably IBM,
  Microsoft, HP, Digital) plus a few of the registered
  chrsets in IANA including the existing ISO 8859-*, GBK,
  and some national standard or de facto standards (Russia,
  Thailand, Japan, Korea).
This early rush did not involve typographers (well
  there was Adobe at this time but still using another
  unrelated technology). Font standards were still not
  existing and were competing in incompatible ways, all was
  a mess at that time, so publishers were still required to
  use proprietary software solutions, with very low
  interoperability (at that time the only "standard" was
  PostScript, not needing any character encoding at all, but
  only encoding glyphs!)
  

  
  
  Thank you for this insight. It is a still untold part of the
  history of Unicode.
This historical summary does not square
in key points with my own recollection (I was there). I would
therefore not rely on it as if gospel truth.
  
In particular, one of the key technologies
that brought industry partners to cooperate around Unicode
was font technology, in particular the development of the TrueType
Standard. I find it not credible that no typographers were
part of that project :).
Covering existing character sets (National,
International and Industry) was an (not "the") important
goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data
migration to Unicode as well as enable Unicode-based systems to
process and display non-Unicode data (by conversion). 
  
The statement: "there was initially no
desire to encode all the languages and scripts" is categorically
false.
(Incidentally, Unicode does not "encode
languages" - no character encoding does).
What has some resemblance of truth is that
the understanding of how best to encode whitespace evolved over
time. For a long time, there was a confusion whether spaces of
different width were simply digital representations of various
metal blanks used in hot metal typography to lay out text. As
the placement of these was largely handled by the typesetter,
not the author, it was felt that they would be better modeled by
variable spacing applied mechanically during layout, such as
applying indents or justification.
  
Gradually it became better understood that
there was a second use for these: there are situations where
some elements of running text have a gap of a specific width
between them, such as a figure space, which is better treated
like a character under authors or numeric formatting control
than something that gets automatically inserted during layout
and rendering.
Other spaces were found best modeled with a
minimal width, subject to expansion during layout if needed.

  
There is a wide range of typographical
quality in printed publication. The late '70s and '80s saw many
books published by direct photomechanical reproduction of
typescripts. These represent perhaps the bottom end of the
quality scale: they did not implement many fine typographical
details and their prevalence among technical literature may have
impeded the understanding of what character encoding support
would be needed for true fine typography. At the same time,
Donald Knuth was refining TeX to restore high quality digital
typography, initially for mathematics.
However, TeX did not have an underlying
character encoding; it was using a completely different model

Re: Encoding italic (was: A last missing link)

2019-01-16 Thread Asmus Freytag via Unicode


  
  
On 1/16/2019 7:38 PM, James Kass via
  Unicode wrote:

Computer
  text tradition aside, nobody seems to offer any legitimate reason
  why such information isn't worthy of being preservable in
  plain-text.  Perhaps there isn't one.
  

By introducing state, even localized, you
are creating a de-fact0 "rich-text" protocol - unless you
duplicate all code points in italics, what you create is 'not so
plain' text.
VS's and similar efforts are pseud0-encoding
are already in a gray zone, but at least for VSs there's an
established protocol how to ignore presentation issues in
processing.
Much of the discussion of 'plain text' here
is very focused on presentation and does not adequately consider
the harm done to the text-processing model that underlies
Unicode.
That's all I'm prepared to contribute for a
bit.
A./

Re: New ideas (from: wws dot org)

2019-01-16 Thread Asmus Freytag (c) via Unicode


On 1/16/2019 9:30 AM, wjgo_10...@btinternet.com wrote:

Asmus Freytag wrote as follows:

 PS: of course, if a contemplated change, such as the one alluded to, 
should be ill advised, its negative effects could have wide ranging 
impacts...but that's not the topic here.


If you object to encoding italics please say so and if possible please 
provide some reasons.


It's not the topic of this thread. Let's keep the discussion in one place.

A./

Re: wws dot org

2019-01-16 Thread Asmus Freytag via Unicode


  
  
On 1/16/2019 6:33 AM, Marcel Schneider
  via Unicode wrote:

So to
  date, Unicode has only made half its way, and for every single
  script in the 
  Standard there is another script out there that remains still
  unsupported.
  
  First things first. When I first replied in the first thread of
  this year I already 
  warned:
  >>> Having said that, still unsupported minority
  languages are top priority. 
  
  I didn’t guess that I opened a Pandora box whose content would
  lead us 
  far away from the only useful goal deeply embedded in the concept
  of 
  Unicode: support all of the world’s writing systems.
You will find that the existing Unicode
support for 28+ modern scripts is
sufficient to cover the languages that are in everyday written
use and likely
to remain so, because they are formally taught to the next
generation.
There are a handful of additional scripts,
also already encoded, that would
cover languages for which such written use is emerging or
re-emerging.
  
The rest of Unicode scripts plus scripts yet
to be encoded are about 
preservation and capture of existing written records, or
transcription of
languages in predominantly spoken use, not primarily about
supporting active everyday communication of language users.
As with all complex scenarios, there may be
this or that edge case that
the generalization above doesn't adequately cover. However, the
fact 
remains that encoding of additional scripts affects a different
realm of
usage.
Extensions contemplated that impact everyday
communication (as
those discussed on a parallel thread here) are therefore
potentially
useful on a very practical level to users of majority and
minority languages
living today. Therefore, the implication that only the
additional coverage
of dead scripts, or transcription of endangered languages is a
useful goal rings a bit false.

It's a very worthwhile goal, but so is making improvements to
those aspects
of Unicode that widely figure in everyday communication of
living populations.
A./
PS: of course, if a contemplated change,
such as the one alluded to, should be ill 
advised, its negative effects could have wide ranging
impacts...but that's
not the topic here.

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 5:41 PM, Mark E. Shoulson
  via Unicode wrote:


  
  On 1/14/19 5:08 AM, Tex via Unicode
wrote:
  
  




  This thread has gone on for a bit and
I question if there is any more light that can be shed.
   
  BTW, I admit to liking Asmus
definition for functions that span text being a definition
or criteria for rich text.
  

  
  Me too.  There are probably some exceptions or weird
corner-cases, but it seems to be a really good encapsulation of
the distinction which I had never seen before.

** blush **
A./

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 2:08 PM, Tex via Unicode
  wrote:


  
  
  
  
Asmus,
 
I
agree 100%. Asking where is the harm was an actual question
intended to surface problems. It wasn’t rhetoric for saying
there is no harm.
  

The harm comes when this is imported into rich text environments
  (like this e-mail inbox). Here, the math abuse and the styled text
  run may look the same, but I cannot search for things based on
  what I see. I see an English or French word, type it in the search
  box and it won't be found. I call that 'stealth' text.
The answer is not necessarily in folding the two, because one of
  the reasons for having math alphabetics is so you can search for a
  variable "a" of  certain kind without getting hits on every "a" in
  the text. Destroying that functionality in an attempt to "solve"
  the problems created by the alternate facsimile of styled text is
  also "harm" in some way.


  

 
Also,
it may not be obvious to social media, messaging platforms,
that there is a possibility of a solution. Often when a
problem exists for a long time, it fades into
unconsciousness. The pain is accepted as that is the way it
is and has to be.
  

A push for (more) universal support of lowest common denominator
  "markdown" would go a long way to support such features in
  environments where SMGL style markup is infeasible and out-of-band
  communication not possible.


  

It
becomes part of the culture. Asking if there is a pain and
whether a solution would be welcomed is consciousness
raising.
 
I
agree about leading standardization. I thought some
legitimate needs were raised. The questions were designed to
quantify the use case as well as the potential damage.
  

Also, treating everything as a character encoding problem is so
  broken.

  

 
I
didn’t think anyone was recommending more math abuse. I
thought it was raised as an example of people resorting to
them as a solution for a need. Of course they are also an
example of playful experimentation.
 
Separately,
Regarding
messaging platforms, although twitter is one example in the
social media space, today there are many business,
commercial, and other applications that embed messaging
capabilities for their communities and for servicing
customers.
I
wouldn’t dismiss the need just based on twitter’s assessment
or on the idea that social media is just for casual or “fun”
use. Clarity of communications can be significant for many
organizations. Having the proposed capabilities in plain
text rather than requiring all of the overhead of a more
rich text solution could be a big win for these apps.
  

I see the math abuse as something that is being done as an
  exercise of playfulness. There are other uses of characters based
  on what they look like, rather than what they mean (or are
  intended for) and much applies to those cases as well.
However, that's independent from making a value judgement on
  social media as such just because some people use the features
  more creatively. That's a judgement that I have neither made nor
  would I be comfortable with it.
A./


  

 
tex
 
 

  
From:
Unicode [mailto:unicode-boun...@unicode.org] On
  Behalf Of Asmus Freytag via Unicode
Sent: Monday, January 14, 2019 1:21 PM
To: unicode@unicode.org
Subject: Re: A last missing link for
interoperable representation
  

 

  On 1/14/2019 2:08 AM, Tex via Unicode
wrote:


  Perhaps the question should be put to
twitter, messaging apps, text-to-voice vendors, and others
whether it will be useful or not.
  If the discussion continues I would
like to see more of a cost/benefit analysis. Where is the
harm? What will the benefit to user communities be?

The
"it does no harm" is never an argument "for" making a
change. It's something of a necessary, but not a sufficient
condition, in other words.
More
to the point, if there were platforms (like social

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 2:43 PM, James Kass via
  Unicode wrote:


  
  Hans Åberg wrote,
  
  
  > How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́
  
  
  Thought about using a combining accent.  Figured it would just
  display with a dotted circle but neglected to try it out first. 
  It actually renders perfectly here.  /That's/ good to know. 
  (smile)
  
  
  

While all of this displays fine, it
currently can't be found in the same search that would locate
true italics.
As I am seeing this in an environment that
otherwise supports rich text, the result is "stealth" text.
Stuff that I can read, but not process, without being able to
see a difference.
  
A./

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 3:37 PM, Richard
  Wordingham via Unicode wrote:


  On Tue, 15 Jan 2019 00:02:49 +0100
Hans Åberg via Unicode  wrote:


  

  On 14 Jan 2019, at 23:43, James Kass via Unicode
 wrote:

Hans Åberg wrote,
  

  
How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́  

  
  
Thought about using a combining accent.  Figured it would just
display with a dotted circle but neglected to try it out first.  It
actually renders perfectly here.  /That's/ good to know.  (smile)  



It is a bit off here. One can try math, too: the derivative of 훾(푡)
is 훾̇(푡).

  
  
No it isn't.  You should be using a spacing character for
differentiation. 

Sorry, but there may be different conventions. The dot /
  double-dot above is definitely common usage in physics.

A./




   On the other hand, one uses a combining circumflex
for Fourier transforms.

Richard.

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 2:58 PM, David Starner via
  Unicode wrote:


  Source code is an example of plain text, and yet adding italics into
comments would require but a trivial change to editors. If the user
audience cared, it would have been done. In fact, I suspect there
exist editors and environments where an HTML subset is put into
comments and rendered by the editors; certainly active links would be
more useful in source code comments than italics.

Source Insight is a nice and powerful
programming editor that supports rich-text display of source
code, i.e. beyond simple syntax coloring / linkification.
For example, large type for function names.
  
They even support some styling in comments,
but more along the lines of allowing their own markdown
convention that let's you write headings of different levels.
Both to write comments that introduce
sections of your code, as well as headings and subheadings
inside longer comment blocks.
So stuff like that exists, but it's using
semantic markup (style settings per language element) or
markdown (styles in comments).
A./

Re: A last missing link for interoperable representation

2019-01-14 Thread Asmus Freytag via Unicode


  
  
On 1/14/2019 2:08 AM, Tex via Unicode
  wrote:


  Perhaps the question should be put to
twitter, messaging apps, text-to-voice vendors, and others
whether it will be useful or not.
  If the discussion continues I would like
to see more of a cost/benefit analysis. Where is the harm? What
will the benefit to user communities be?

The "it does no harm" is never an argument
"for" making a change. It's something of a necessary, but not a
sufficient condition, in other words.
More to the point, if there were platforms
(like social media) that felt an urgent need to support styling
without a markup language, and could articulate that need in
terms of a proposal, then we would have something to discuss.
(We might engage them in a discussion of the advisability of
supporting "markdown", for example).
  
Short of that, I'm extremely leery of
"leading" standardization; that is, encoding things that "might"
be used.
As for the abuse of math alphabetics. That's
happening whether we like it or not, but at this point
represents playful experimentation by the exuberant fringe of
Unicode users and certainly doesn't need any additional
extensions.

Re: A last missing link for interoperable representation

2019-01-12 Thread Asmus Freytag via Unicode


  
  
On 1/12/2019 5:22 AM, Richard
  Wordingham via Unicode wrote:


  On Sat, 12 Jan 2019 10:57:26 + (GMT)
Julian Bradfield via Unicode  wrote:


  
It's also fundamentally misguided. When I _italicize_ a word, I am
writing a word composed of (plain old) letters, and then styling the
word; I am not composing a new and different word ("_italicize_") that
is distinct from the old word ("italicize") by virtue of being made up
of different letters.

  
  
And what happens when you capitalise a word for emphasis or to begin a
sentence?  Is it no longer the same word?



Typographically, the act of using italics or different font
  weight is more akin to using a different font than to using
  different letters. Not only did old metal types require the
  creation of a different font (albeit with a design coordinated
  with the regular type) but even in the digital world, purpose
  designed italic etc. typefaces beat attempts at parametrizing
  regular fonts. (Although some of the intelligence that goes into
  creating those designs can nowadays be approximated by
  automation).
What this teaches you is that italicizing (or boldfacing) text is
  fundamentally related to picking out parts of your text in a
  different font. It's an operation on a span of text, not something
  that results in different letters (or letter attributes).
Deep in the age of metal type this would have been no surprise to
  users. As I had occasion to mention before, some languages had the
  (rather universally observed) typographical convention of setting
  apart foreign term apart by using a different font (Antiqua vs.
  Fraktur for ordinary text). At the same time, other languages used
  italics for the same purpose (which technically also meant using a
  different typeface).
To go further, the use of typography to mark emphasis also
  followed conventions that focused on spans of letters not on the
  individual letters. For example, in Fraktur, you would never have
  been able to emphasize a single letter, as emphasis was conveyed
  by increased inter-letter spacing. (That restriction was not as
  limiting as it appears in languages that do not have single-letter
  words).
Anyway, this points to a way to make the distinction between
  plain text and rich text a more principled one (and explains why
  math alphabets seemingly form an exception).
The domain of rich text are all typographic and stylistic
  elements that establish spans of text, whether that is
  underlining, emphasis, letter spacing, font weight, type face
  selection or whatever. Plain text deals with letters in a way that
  is as stateless as possible, that is, does not set up spans. Math
  alphabetics are an exception by virtue of the fact that they are
  individual letters that have a particular identity different from
  the "same" letter in text or the "same" letter that's part of a
  different math alphabet.
So those screen readers got it right, except that they could have
  used one of the more typical notational conventions that the
  mathalphabetics are used to express (e.g. "vector" etc.), rather
  than rattling off the Unicode name.
To reiterate, if you effectively require a span (even if you
  could simulate that differently) you are in the realm or rich
  text. The one big exception to that is bidi, because it is utterly
  impossible to do bidi text without text ranges. Therefore, Unicode
  plain text explicitly violates that principle in favor of
  achieving a fundamental goal of universality, that is being able
  to include the bidi languages.
None of the other uses contemplated here rise to the same level
  of violating a fundamental goal in the same way.
A./

Re: A last missing link for interoperable representation

2019-01-09 Thread Asmus Freytag via Unicode


  
  
On 1/9/2019 4:41 PM, Mark E. Shoulson
  via Unicode wrote:


  
  On 1/9/19 2:30 AM, Asmus Freytag via
Unicode wrote:
  
  

English use of italics on isolated words
to disambiguate the reading of some sentences is a
convention. Everybody who does it, does it the same way. Not
supported in plain text.
German books from the Fraktur age used
Antiqua for Latin and other foreign terms. Definitely a
convention that was rather universally applied (in books at
least). Not supported in plain text.
  
  Aren't there printing conventions that
  indicate this type of "contrastive stress" using letterspacing
  instead of font style?  I'm s u r e I've seen it in German and
  other Latin-written languages, and also even occasionally in
  Hebrew, whose experiments with italics tend not to be
  encouraging.

That's a related issue. Fraktur doesn't have an italic style, so
  emphasis is generally done by letterspacing -- and that
  letterspacing better respect the mandatory ligatures in Fraktur
  (they are neither spaced, nor replaced by non-ligated letters).
Because of the fact that typesetting Fraktur follows a number a
  number of conventions not found when the same text is typeset in
  Roman fonts, there's simply no way that you can shift between
  these by something like a simple style sheet, and definitely not
  by taking plain text and globally selecting a Fraktur font (see
  foreign terms issue above).
Theoretically, you should be able to do so, because Latin script
  use across all typographic traditions is unified in the encoding,
  but in practice, you'll run into limitations and only some
  final-form rich text format (like PDF) will guarantee that stuff
  appears correct and as intended.
This is perhaps interesting because many books of the period
  exist in editions in either typographic style. Trying to get the
  two different renditions from the same "backbone" would require,
  at a minimum, very careful semantic mark-up (e.g. identifying
  foreign words) and a non-trivial stylesheet (assuming that you can
  even get the correct letterspacing done by your rendering engine).

A./

Re: A last missing link for interoperable representation

2019-01-09 Thread Asmus Freytag via Unicode


  
  
On 1/9/2019 1:37 AM, Tex via Unicode
  wrote:


  
  
  
  
 
   James Kass wrote:

  If a text is published in all italics, that’s style/font
  choice.  If a text is published using italics and roman
  contrastively and consistently, and everybody else is doing it
  pretty much the same way, that’s a convention. 
 
   Asmus Freytag responded:
But
not all conventions are deemed worth of plaintext encoding.
What
are the criteria for “worth”?
  

See answer to Jame's post.

  

Way
back when, when plain text was very very plain, arguments
about not including text styling seemed reasonable. But with
the inclusion of numerous emoji as James mentioned, it seems
odd to be protesting a few characters that would enhance
“plain text” considerably. Plain text editors today support
bold, italic, and other styles as a fundamental requirement
for usability. More text editors support styling than
support bidi or interlinear annotation.
If
there was support for the handful of text features used by
most plain text editors (bold, italic, strikethrough,
underline, superscript, subscript, et al) (perhaps using
more generalized names such as emphasis, stress, deleted…)
then many of the redundant (bold, italic, …)  characters in
Unicode would not have been needed. HTML seemed to do very
well with a very few styling elements. HTML is of course
rich text, but I am just demonstrating that a very small
number of control characters would bring plain text into the
modern state of text editing. Editors that don’t have the
capability for bolding, underlining, etc. could ignore these
controls or convert them to another convention.
As
James requested, it would also provide interoperability.
Arguments
about all of the conventions that Unicode does not support
doesn’t seem compelling to me, as it seems increasingly
random as what is accepted and what isn’t, or at least the
rationales seem inconsistent.
A
case in point is the addition of the “SS” character which
made implementation complex with little benefit.
Interlinear
annotation is perhaps another example.
I
don’t want to enter into a debate about why these deserved
inclusion. I am only saying they seem less useful than some
other cases which seem deserving. 
**And
right now, Dr. Strangelove style, my right hand is
restraining my other hand from typing on the keyboard, to
avoid saying anything about emoji.**
Ken
distinguished numerous variations of stress, which of course
have their place, representations and uses. But perhaps for
plain text we only need a way to indicate “stress here”,
leave it to the text editor to have some form of rendering.
For more distinctions the user needs to use rich text.
Surely there is an 80/20 rule that motivates a solution
rather than letting the one percent prevent a capability
that 99% would enjoy.
(Yes
I mixed metaphors. I feel an Occupy Unicode movement coming
on. J )
I
don’t see how adding a few text style controls would be a
burden to most implementers. Given ideographic variation
sequences, skin tones, hair styles, and the other
requirements for proper Unicode support, arguing against a
few text styling capabilities seems very last century. (Or
at least 1990s…) And it might save having to add a few more
bold, italic, superscript, et al compatibility characters…
tex

Re: A last missing link for interoperable representation

2019-01-09 Thread Asmus Freytag via Unicode

On 1/9/2019 1:06 AM, James Kass via
  Unicode wrote:

  Asmus Freytag wrote,

  > Still, not supported in plain text (unless you abuse the

  > math alphabets for things they were not intended for).

  The unintended usage of math alphanumerics in the real world is
  fairly widespread, at least in screen names.

  (I still get a kick out of this:)

  http://www.ewellic.org/mathtext.html

  I wonder how many times Doug's program has been downloaded.

  Whether it's "abuse" or not might depend on whether one considers
  the user community of the machines which process the texts to be
  more important than the user community of human beings who author,
  exchange, and read the texts.

   It's "abuse" because all these extensions for symbols only
  ever cover the ASCII range. Couldn't do actual German Fraktur text
  with the math alphabets (other than selected words). Same for
  italics.
A good test might be whether something would require duplicating
  the entire Unicode range to achieve full coverage (or at least
  significant subsets like multiple, entire scripts).

  Real humans are the user community of the UCS.  It's up to the
  user community to determine how its letters and symbols get used. 
  That's the general rule-of-thumb Unicode applies to the subset
  user communities, and it should apply to the complete superset as
  well.

There's a cost to providing multiple ways of achieving the same
  effect and it cuts against the "uniqueness" in encoding that
  Unicode set out to achieve ("unique", "universal" and "uniform"
  were the three mantras that launched the standard).
A./

Re: A last missing link for interoperable representation

2019-01-08 Thread Asmus Freytag via Unicode


  
  
On 1/8/2019 10:58 PM, James Kass via
  Unicode wrote:


  If a text is published in all italics, that’s style/font choice. 
  If a text is published using italics and roman contrastively and
  consistently, and everybody else is doing it pretty much the same
  way, that’s a convention.

But not all conventions are deemed worth of
plaintext encoding.
English use of italics on isolated words to
disambiguate the reading of some sentences is a convention.
Everybody who does it, does it the same way. Not supported in
plain text.
German books from the Fraktur age used
Antiqua for Latin and other foreign terms. Definitely a
convention that was rather universally applied (in books at
least). Not supported in plain text.
In the first example, the mere need for
disambiguation tells you that contrastive use should be
possible:  while some cases might not be truly ambiguous, but
just misleading the reader, the ambiguity implies that more than
one alternate reading may be possible and thus the use of
italics would be contrastive.
In the second example, some foreign words
use the same spelling as German words; the convention makes
clear which is intended, and dropping it where the author relies
on it, might well introduce ambiguity. Most of that convention
wouldn't be contrastive, but in some cases it easily would be.
In either case, you lose information that's
related to content and not merely to dressing up the text in a
"pretty" way.
Still, not supported in plain text (unless
you abuse the math alphabets for things they were not intended
for).
Like so many general statements relating to
Unicode, even this carries its exceptions.
A./

Re: A last missing link for interoperable representation

2019-01-08 Thread Asmus Freytag via Unicode

On 1/8/2019 1:11 PM, James Kass via
  Unicode wrote:

  Asmus Freytag wrote,

  > ...

  > (for an extreme example there's an orthography

  > out there that uses @ as a letter -- we know that

  > won't work well with email addresses and duplicate

  > encoding of the @ shape is a complete non-starter).

  Everything's a non-starter.  Until it begins.

It's a non-starter because of the security-sensitive nature of @.

  Is this a casing orthography?  (Please see attached image.)

  We've seen where typewriter kludges enabled users to represent the
  glottal stop with a question mark (or a digit seven).  Unicode
  makes those kludges unnecessary.

  But we're still using typewriter kludges to represent stress in
  Latin script because there is no Unicode plain text solution.

Re: A last missing link for interoperable representation

2019-01-07 Thread Asmus Freytag via Unicode


  
  
On 1/7/2019 10:40 PM, Marcel Schneider
  via Unicode wrote:

The
  pitch is that if some languages are still considered “needing”
  rich text where others are correctly represented in plain text
  (stress, abbreviations), the Standard needs to be updated in a way
  that it fully supports actually all languages.
There will always be some texts (in the most
general sense of this term!) that will require certain features
found only in rich text, and there are some unusual
orthographies don't play well in the context of certain
technologies (for an extreme example there's an orthography out
there that uses @ as a letter -- we know that won't work well
with email addresses and duplicate encoding of the @ shape is a
complete non-starter).
A./

Re: A last missing link for interoperable representation

2019-01-07 Thread Asmus Freytag via Unicode


  
  
On 1/7/2019 7:46 PM, James Kass via
  Unicode wrote:

Making
  recommendations for the post processing of strings containing the
  combining low line strikes me as being outside the scope of
  Unicode, though.
Agreed. 
  
Those kinds of things are effectively "mark
down" languages, a name chosen to define them as lighter weight
alternatives to formal, especially SGML derived mark-up
languages.
Neither mark-up nor mark down languages are
in scope.
  
A./

Re: Compatibility Casefold Equivalence

2018-11-24 Thread Asmus Freytag via Unicode


  
  
On 11/22/2018 11:58 AM, Carl via
  Unicode wrote:


  (It looks like my HTML email got scrubbed, sorry for the double post)

Hi,


In Chapter 3 Section 13, the Unicode spec defines D146:


"A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)"


I am trying to understand the "if and only if" part of this.   Specifically, why is the outermost NFKD necessary?  Could it also be a NFKC normalization?   Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?


My use case is that I am trying to store user-provided tags in a database.  I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.  However, because decomposition can result in much larger strings, I would prefer to keep  the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).



Carl,
you may find that some of the complications are limited to a
  small number of code points. In particular, classical (polytonic)
  Greek has some gnarly behavior wrt case; and some compatibility
  characters have odd edge cases.

I'm personally not a fan of allowing every single Unicode code
  point in things like usernames (or other types of identifiers).
  Especially, if including some code points makes the "general case"
  that much more complex, my personal recommendation would be to
  simply disallow / reject a small set of troublesome characters;
  especially if they aren't part of some widespread modern
  orthography. 

While Unicode is about being able to digitally represent all
  written text, identifiers don't follow the same rules. The main
  reason why people often allow "anything" is because it's easy in
  terms of specification. Sometimes, you may not have control over
  what to accept; for example if tags are generated from headers in
  a document, it would require some transform to handle disallowed
  code points.
Case is also only one of the types of duplication you may
  encounter. In many South and South East Asian scripts you may
  encounter cases where two sequences of characters, while
  different, will normally render identical. Arabic also has
  instances of that. Finally, you may ask yourself whether your
  system should treat simplified and traditional Chinese ideographs
  as separate or as a variant not unlike the way you treat case.
About storing your tag data: you can obviously store them as NFC,
  if you like: in that case, you will have to run the operations
  both on the stored and on the new tag.
Finally, there are some cases where you can tell that two string
  are identical without actually carrying out the full set of
  operations:
Y = X
NFC(Y) = NFC(X)
and so on. (If these conditions are true, the full condition
  above must also be true). For example, let's apply 

NFKD(toCasefold(NFKD(toCasefold(NFD(X)
on both sides of

NFC(Y) = NFC(X)
First:

NFD(NFC(Y)) = NFD(NFC(X))
Because the two sides are equal, applying toCaseFold results in
  equal strings, and so on all the way to the outer NFKD.
In other words, you can stop the comparison at any point where
  the two sides are equal. From that point on, the outer operations
  cannot add anything.

A./

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1250 matches

Mail list logo