Re: NNBSP

2019-01-19 Thread Asmus Freytag via Unicode

  
  
On 1/19/2019 3:53 AM, James Kass via
  Unicode wrote:


  
  Marcel Schneider wrote,
  
  
  > When you ask for knowing the foundations and that knowledge
  is persistently refused,
  
  > you end up believing that those foundations just can’t be
  told.
  
  >
  
  > Note, too, that I readily ceased blaming UTC, and shifted the
  blame elsewhere, where it
  
  > actually belongs to.
  
  
  Why not think of it as a learning curve?  Early concepts and
  priorities were made from a lower position on that curve.  We can
  learn from the past and apply those lessons to the future, but a
  post-mortem seldom benefits the cadaver.
  



+1. Well put about the cadaver.


  
  Minutiae about decisions made long ago probably exist, but may be
  presently poorly indexed/organized and difficult to search/access.
  As the collection of encoding history becomes more sophisticated
  and the searching technology becomes more civilized, it may become
  easier to glean information from the archives.
  
  
  (OT - A little humor, perhaps...
  
  On the topic of Francophobia, it is true that some of us do not
  like dead generalissimos.  But most of us adore the French for
  reasons beyond Brigitte Bardot and bon-bons.  Cuisine, fries, dip,
  toast, curls, culture, kissing, and tarts, for instance.  Not to
  mention cognac and champagne!)
  
  
  

It is time for this discussion to be
  moved to a small group of people interested in hashing out
actual proposals for submission. Is there anyone here who
would like to collaborate with Marcel to find a solution for
European number formatting that
(1) fully supports the typographic best
practice
  
(2) identifies acceptable fall backs
  
(3) is compatible with existing legacy
practice, even if that does not conform to (1) or (2)
(4) includes necessary adjustments to CLDR 
  

  
If nobody here is interested in working on
that, discussing this further on this list will not serve a
useful purpose, as nothing will change in Unicode without a
well-formulated proposal that covers the four parameters laid
out here.
A./
  
  



Re: NNBSP

2019-01-19 Thread James Kass via Unicode



Marcel Schneider wrote,

> When you ask for knowing the foundations and that knowledge is 
persistently refused,

> you end up believing that those foundations just can’t be told.
>
> Note, too, that I readily ceased blaming UTC, and shifted the blame 
elsewhere, where it

> actually belongs to.

Why not think of it as a learning curve?  Early concepts and priorities 
were made from a lower position on that curve.  We can learn from the 
past and apply those lessons to the future, but a post-mortem seldom 
benefits the cadaver.


Minutiae about decisions made long ago probably exist, but may be 
presently poorly indexed/organized and difficult to search/access. As 
the collection of encoding history becomes more sophisticated and the 
searching technology becomes more civilized, it may become easier to 
glean information from the archives.


(OT - A little humor, perhaps...
On the topic of Francophobia, it is true that some of us do not like 
dead generalissimos.  But most of us adore the French for reasons beyond 
Brigitte Bardot and bon-bons.  Cuisine, fries, dip, toast, curls, 
culture, kissing, and tarts, for instance.  Not to mention cognac and 
champagne!)




Re: NNBSP

2019-01-19 Thread Marcel Schneider via Unicode

On 19/01/2019 09:42, Asmus Freytag via Unicode wrote:

[…]

For one, many worthwhile additions / changes to Unicode depend on getting written up in 
proposal form and then championed by dedicated people willing to see through the process. 
Usually, Unicode has so many proposals to pick from that at each point there are more 
than can be immediately accommodated. There's no automatic response to even issues that 
are "known" to many people.

"Demands" don't mean a thing, formal proposals, presented and then refined 
based on feedback from the committee is what puts issues on the track of being resolved.


That is also what I suspected, that the French were not eager enough to get 
French supported, as opposed to the Vietnamese who lobbied long before the era 
of proposals and UTC meetings.

Please,/where can we find the proposals for FIGURE SPACE to become 
non-breakable, and for PUNCTUATION SPACE to stay or become breakable?/

(That is not a rhetoric question. The ideal answer is a URL.
Also, that is not about pre-Unicode documentation, but about the action that 
Unicode took in that era.)


[…]

Yes, I definitely used an IBM Selectric for many years with interchangeable type wheels, 
but I don't remember using proportional spacing, although I've seen it in the kinds of 
"typescript" books I mentioned. Some had that crude approximation of 
typesetting.


Thanks for reporting.


When Unicode came out, that was no longer the state of the art as TeX and laser 
printers weren't limited that way.

However, the character sets from which Unicode was assembled (or which it had 
to match, effectively) were designed earlier - during those times. And we 
inherited some things (that needed to be supported so round-trip mapping of 
data was possible) but that weren't as well documented in their particulars.

I'm sure we'll eventually deprecate some and clean up others, like the 
Mongolian encoding (which also included some stuff that was encoded with an 
understanding that turned out less solid in retrospect than we had thought at 
the time).

Something the UTC tries very hard to avoid, but nobody is perfect. It's best 
therefore to try not to ascribe non-technical motives to any action or inaction 
of the UTC. What outsiders see is rarely what actually went down,


That is because the meeting minutes would gain in being more explicit.


and the real reasons for things tend to be much less interesting from an 
interpersonal  or intercultural perspective.


I don’t care about “interesting” reasons. I’d just appreciate to know the truth.


So best avoid that kind of topic altogether and never use it as basis for 
unfounded recriminations.


When you ask for knowing the foundations and that knowledge is persistently 
refused, you end up believing that those foundations just can’t be told.

Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, 
where it actually belongs to. I’d kindly request not to be considered a 
hypocrite that in reality keeps blaming the UTC.


A./





Re: NNBSP

2019-01-19 Thread Marcel Schneider via Unicode

On 19/01/2019 01:21, Shawn Steele wrote:


*>> *If they are obsolete apps, they don’t use CLDR / ICU, as these are 
designed for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a 
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any 
thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. 
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/


Thanks for sharing. As it happens, I like most the first reason you provide:

 * “The most obvious reason is that there is a bug in the data and we had to 
make a change. (Believe it or not we make mistakes ;-))  In this case our users 
(and yours too) want culturally correct data, so we have to fix the bug even if 
it breaks existing applications.”


No comment :)


>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.


No problem.


>> What are all these expected to do while localized with scripts outside 
Windows code pages?

(We call those “unicode-only” locales FWIW)


Noted.


The users that are not supported by legacy apps can’t use those apps 
(obviously).  And folks are strongly encouraged to write apps (and protocols) 
that Use Unicode (I’ve blogged about that too).



Like here:
https://blogs.msdn.microsoft.com/shawnste/2009/06/01/writing-fields-of-data-to-an-encoded-file/

You’re showcasing that despite “The moral here is ‘Use Unicode’ ” some people 
are still not using it. The stuff gets even weirder as you state that code 
pages and Unicode are not 1:1, contradicting the Unicode design principle of 
roundtrip compatibility.

The point in not using Unicode, and likewise in not using verbose formats, is 
limited hardware resources. Often new implementations are built on top of old 
machines and programs, for example in the energy and shipping industies. This 
poses a security threat, ending up in power outages and logistic breakdowns. 
That is making our democracies vulnerable. Hence maintaining obsolete systems 
does not pay back. We’re all better off when recycling all the old hardware and 
investing in latest technologies, implementing Unicode by the way.

What you are advocating in this thread seems like a non-starter.


However, the fact that an app may run very poorly in Cherokee or whatever 
doesn’t mean that there aren’t a bunch of French enterprises that depend on 
that app for their day-to-day business.


They’re ill-advised in doing so (see above).


In order for the “unicode-only” locale users to use those apps, the app would 
need to be updated, or another app with the appropriate functionality would 
need to be selected.


To be “selected”, not developed and built. The job is already done. What are 
people waiting for?


However, that still doesn’t impact the current French users that are “ok” with 
their current non-Unicode app.  Yes, I would encourage them to move to Unicode, 
however they tend to not want to invest in migration when they don’t see an 
urgent need.


They may not see it because they’re lacking appropriate training in cyber 
security. You seem to be backing that unresponsive behavior. I can’t see that 
you may be doing any good by doing so, and I’d strongly advise you to reach out 
to your customers, or check the issue with your managers. We’re in a time where 
companies are still making huge benefits, and it is unclear where all that 
money goes once paid out to shareholders. The money is there, you only need to 
market the security. That job would better use your time than tampering with 
legacy apps.


Since Windows depends on CLDR and ICU data, updates to that data means that 
those customers can experience pain when trying to upgrade to newer versions of 
Windows.  We get those support calls, they don’t tend to pester CLDR.


Am I pestering CLDR…

Keeping CLDR in synch is just the right way to go.

Since we’re on it: Do you have any hints about why some powerful UTC members 
seem to hate NNBSP in French?
I’m mainly talking about French punctuation spacing here.


Which is why I suggested an “opt-in” alt form that apps wanting “civilized” 
behavior could opt-into (at least for long enough that enough badly behaved 
apps would be updated to warrant moving that to the default.)



Asmus Freytag’s proposal seems better:

   “having information on "common fallbacks" would be useful. If formatting 
numbers, I may be free to pick the "best",
   but when parsing for numbers I may want to know what deviations from 
"best" practice I can expect.”


Because if you let your customers “opt in” instead of urging them to update, 
some will never opt in, given they

Re: NNBSP

2019-01-19 Thread Asmus Freytag via Unicode

  
  
On 1/18/2019 11:34 PM, Marcel Schneider
  via Unicode wrote:


  
Current
  practice in electronic publishing was to use a non-breakable
  thin space, Philippe Verdy reports. Did that information come
  in somehow?

==> probably not in the early days. Y

  
  Perhaps it was ignored from the beginning on, like Philippe Verdy
  reports that UTC ignored later demands, getting users upset. 

==> for reasons given in another post, I tend to not give much
  credit to these suggestions. 

For one, many worthwhile additions / changes to Unicode depend on
  getting written up in proposal form and then championed by
  dedicated people willing to see through the process. Usually,
  Unicode has so many proposals to pick from that at each point
  there are more than can be immediately accommodated. There's no
  automatic response to even issues that are "known" to many people.
"Demands" don't mean a thing, formal proposals, presented and
  then refined based on feedback from the committee is what puts
  issues on the track of being resolved.

 That
  leaves us with the question why it did so, downstream your
  statement that it was not what I ended up suspecting.
  
  Does "Y" stand for the peace symbol?

==> No, my thumb sometimes touches the touchpad and flicks the
  cursor while I type. I don't always see where some character end
  up. Or, I start a sentence and the phone rings. Or any of a number
  of scenarios. Take your pick.

  
 
 
  ISO 31-0 was published in 1992, perhaps too late for Unicode.
  It is normally understood that the thousands separator should
  not have the width of a digit. The allaged reason is security.
  Though on a typewriter, as you state, there is scarcely any
  other option. By that time, all computerized text was fixed
  width, Philippe Verdy reports. On-screen, I figure out, not in
  book print

==> much book printing was also done by photomechanically
  reproducing typescript at that time. Not everybody wanted to
  pay typesetters and digital typesetting wasn't as advanced. I
  actually did use a digital phototypesetter of the period a few
  years before I joined Unicode, so I know. It was more powerful
  than a typewriter, but not as powerful as TeX or later the
  Adobe products.
For one, you didn't typeset a page, only a column of text,
  and it required manual paste-up etc.

  
  Did you also see typewriters with proportional advance width (and
  interchangeable type wheels)? That was the high end on the
  typewriter market. (Already mentioned these typewriters in a
  previous e‑mail.) Books typeset this way could use bold and (less
  easy) italic spans.
Yes, I definitely used an IBM Selectric for many years with
  interchangeable type wheels, but I don't remember using
  proportional spacing, although I've seen it in the kinds of
  "typescript" books I mentioned. Some had that crude approximation
  of typesetting.
When Unicode came out, that was no longer the state of the art as
  TeX and laser printers weren't limited that way.
However, the character sets from which Unicode was assembled (or
  which it had to match, effectively) were designed earlier - during
  those times. And we inherited some things (that needed to be
  supported so round-trip mapping of data was possible) but that
  weren't as well documented in their particulars.
I'm sure we'll eventually deprecate some and clean up others,
  like the Mongolian encoding (which also included some stuff that
  was encoded with an understanding that turned out less solid in
  retrospect than we had thought at the time).
Something the UTC tries very hard to avoid, but nobody is
  perfect. It's best therefore to try not to ascribe non-technical
  motives to any action or inaction of the UTC. What outsiders see
  is rarely what actually went down, and the real reasons for things
  tend to be much less interesting from an interpersonal  or
  intercultural perspective. So best avoid that kind of topic
  altogether and never use it as basis for unfounded recriminations.
A./

  



Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:

On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:


Marcel,

about your many detailed *technical* questions about the history of character 
properties, I am afraid I have no specific recollection.


Other List Members are welcome to join in, many of whom are aware of how things 
happened. My questions are meant to be rather simple. Summing up the premium 
ones:

 1. Why does UTC ignore the need of a non-breakable thin space?
 2. Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with 
proportional advance width were used to write books ready for print.

Another question you do answer below:


French is not the only language that uses a space to group figures. In fact, I 
grew up with thousands separators being spaces, but in much of the existing 
publications or documents there was certainly a full (ordinary) space being 
used. Not surprisingly, because in those years documents were typewritten and 
even many books were simply reproduced from typescript.

When it comes to figures, there are two different types of spaces.

One is a space that has the same width a digit and is used in the layout of lists. For example, if 
you have a leading currency symbol, you may want to have that lined up on the left and leave the 
digits representing the amounts "ragged". You would fill the intervening spaces with this 
"lining" space character and everything lines up.


That is exactly how I understood hot-metal typesetting of tables. What 
surprises me is why computerized layout does work the same way instead of using 
tabulations and appropriate tab stops (left, right, centered, decimal [with all 
decimal separators lining up vertically).


==> At the time Unicode was first created (and definitely before that, during the time of 
non-universal character sets) many applications existed that used a "typewriter 
model" and worked by space fill rather than decimal-point tabulation.


If you are talking about applications, as opposed to typesetting tables for 
book printing, then I’d suggest that the fixed-width display of tables could be 
done much like still today’s source code layout, where normal space is used for 
that purpose. In this use case, line wrap is typically turned off. That could 
make non-breakable spaces sort of pointless (but I’m aware of your point 
below), except if people are expected to re-use the data in other environments. 
In that case, best practice is to use NNBSP as thousands separator while 
displaying it like other monospace characters. That’s at least how today’s 
monospace fonts work (provided they’re used in environments actually supporting 
Unicode, which may not happen with applications running in terminal).


From today's perspective that older model is inflexible and not the best 
approach, but it is impossible to say how long this legacy approach hung on in 
some places and how much data might exist that relied on certain long-standing 
behaviors of these space characters.


My position since some time is that legacy apps should use legacy libraries. 
But I’ll come back on this when responding to Shawn Steele.


For a good solution, you always need to understand

(1) the requirement of your "index" case (French, in this case)


That’s okay.


(2) how it relates to similar requirements in (all!) other languages / scripts


That’s rather up to CLDR as I suggested, given it has the means to submit a 
point to all vetters. See again below (in the part that you’ve cut off without 
consideration).


(3) how it relates to actual legacy practice


That’s Shawn Steele’s point (see next reply).


(3a) what will suddenly no longer work if you change the properties on some 
character

(3b) what older data will no longer work if the effective behavior of newer 
applications changes


I’ll already note that this needs to be aware of actual use cases and/or to 
delve into the OSes, that is far beyond what I can currently do, both wrt time 
and wrt resources. The vetter’s role is to inform CLDR with correct data from 
their locale. CLDR is then welcome to sort things out and to get in touch with 
the industry, which CLDR TC is actually doing. But that has no impact on the 
data submitted at survey time. Changing votes to tell “OK let the group 
separator be NBSP as long as…” would be a lie.



In lists like that, you can get away with not using a narrow thousands 
separator, because the overall context of the list indicates which digits 
belong together and form a number. Having a narrow space may still look nicer, 
but complicates the space fill between the symbol and the digits.


It does not, provided that all numbers have thousands separators, even if 
filling with spaces. It looks nicer because it’s more legible.


Now for numbers in running text using an o

Re: NNBSP

2019-01-18 Thread Richard Wordingham via Unicode
On Fri, 18 Jan 2019 10:20:22 -0800
Asmus Freytag via Unicode  wrote:

> However, if there's a consensus interpretation of a given character
> the you can't just go in and change it, even if it would make that
> character work "better" for a given circumstance: you simply don't
> know (unless you research widely) how people have used that character
> in documents that work for them. Breaking those documents
> retroactively, is not acceptable.

Unless the UCD contains a contrary definition only usable where the
character wouldn't normally be used, in which case it is fine to try
to kick the character's users in the teeth. I am referring to the
belief that ZWSP separated words, whereas the UCD only defined it as a
lay-out control.  That outlawed belief has recently been very helpful
to me in using (as opposed to testing) a nod-Lana spell-checker on
Firefox.

Richard.


Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode

  
  
On 1/18/2019 2:46 PM, Shawn Steele via
  Unicode wrote:


  >> That
should not impact all other users out there interested in a
civilized layout.
  I’m not sure
that the choice of the word “civilized” adds value to the
conversation.  We have pretty much zero feedback that the OS’s
French formatting is “uncivilized” or that the NNBSP is required
for correct support.  
  >> As long
as SegoeUI has NNBSP support, no worries, that’s what CLDR data
is for.
  For
compatibility, I’d actually much prefer that CLDR have an alt
“best practice” field that maintained the existing U+00A0
behavior for compatibility, yet allowed applications wanting the
newer typographic experience to opt-in to the “best practice”
alternative data.  As applications became used to the idea of an
alternative for U+00A0, then maybe that could be flip-flopped
and put U+00A0 into a “legacy” alt form in a few years.
  Normally I’m all
for having the “best” data in CLDR, and there are many locales
that have data with limited support for whatever reasons. 
U+00A0 is pretty exceptional in my view though, developers have
been hard-coding dependencies on that value for ½ a century
without even realizing there might be other types of
non-breaking spaces.  Sure, that’s not really the best practice,
particularly in modern computing, but I suspect you’ll still
find it taught in CS classes with little regard to things like
    NNBSP.

Shwan, 
  
having information on "common fallbacks"
would be useful. If formatting numbers, I may be free to pick
the "best", but when parsing for numbers I may want to know what
deviations from "best" practice I can expect.
A./


  



Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode
ose were the correct set
  (just like we took punctuation from ASCII and similar sources and
  only added to it later, when we understood that they were missing
  things --- generally always added, generally did not redefine
  behavior or shape of existing code points).

 
  Current practice in electronic publishing was to use a
  non-breakable thin space, Philippe Verdy reports. Did that
  information come in somehow?

==> probably not in the early days. Y

 
  ISO 31-0 was published in 1992, perhaps too late for Unicode. It
  is normally understood that the thousands separator should not
  have the width of a digit. The allaged reason is security. Though
  on a typewriter, as you state, there is scarcely any other option.
  By that time, all computerized text was fixed width, Philippe
  Verdy reports. On-screen, I figure out, not in book print

==> much book printing was also done by photomechanically
  reproducing typescript at that time. Not everybody wanted to pay
  typesetters and digital typesetting wasn't as advanced. I actually
  did use a digital phototypesetter of the period a few years before
  I joined Unicode, so I know. It was more powerful than a
  typewriter, but not as powerful as TeX or later the Adobe
  products.
For one, you didn't typeset a page, only a column of text, and it
  required manual paste-up etc.


  
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language
  using some form of blank as a thousands separator - solving
  only the French issue is not enough. We should not do this a
  language at a time.
  
  That is how CLDR works. 
CLDR data is by definition per-language. Except for inheritance,
  languages are independent.
There are no "French" characters. When you encode characters, at
  best, some code points may be script-specific. For punctuation and
  spaces not even that may be the case. Therefore, as long as you
  try to solve this as if it only was a French problem, you
  are not doing proper character encoding.




  
 Do you have colleagues in Germany and other countries that
  can confirm whether their practice matches the French usage in
  all details, or whether there are differences? (Including
  differently acceptability of fallback renderings...).
  
  No I don’t but people may wish to read German Wikipedia:
  
  https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen
  
  Shared in ticket #11423:
  https://unicode.org/cldr/trac/ticket/11423#comment:15



==> for your proposal to be effective, you need to reach out.

 
  
(2) have a solution that works for lining figures as well as
  separators.
(3) have a solution that understands ALL uses of spaces that
  are narrower than normal space. Once a character exists in
  Unicode, people will use it on the basis of "closest fit" to
  make it do (approximately) what they want. Your proposal needs
  to address any issues that would be caused by reinterpreting a
  character more narrowly that it has been used. Only by
  comprehensively identifying ALL uses of comparable spaces in
  various languages and scripts, you can hope to develop a
  solution that doesn't simply break all non-French text in
  favor of supporting French typography.
  
  There is no such problem except that NNBSP has never worked
  properly in Mongolian. It was an encoding error, and that is the
  reason why to date, all font developers unanimously request the
  Mongolian Suffix Connector. That leaves the NNBSP for what it is
  consistently used outside Mongolian: a non-breakable thin space,
  kind of a belated avatar 
of what
  PUNCTUATION SPACE should have been since the beginning.

==> I mentioned before that if something is universally
  "broken" it can sometimes be resurrected, because even if you
  change its behavior retroactively, it will not change something
  that ever worked correctly. (But you need to be sure that nobody
  repurposed the NNBSP for something useful that is different from
  what you intend to use it for, otherwise you can't change anything
  about it).

If, however, you are merely adding a use for some existing
  character that does not affect its properties, that is usually not
  as much of a problem - as long as we can have some confidence that
  both usages will continue to be possible.


  
Perhaps you see why this issue has languished for so long:
  getting it right is not a simple matte

RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed 
>> for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a 
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any 
>> thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. 
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

>> What are all these expected to do while localized with scripts outside 
>> Windows code pages?

(We call those “unicode-only” locales FWIW)

The users that are not supported by legacy apps can’t use those apps 
(obviously).  And folks are strongly encouraged to write apps (and protocols) 
that Use Unicode (I’ve blogged about that too).  However, the fact that an app 
may run very poorly in Cherokee or whatever doesn’t mean that there aren’t a 
bunch of French enterprises that depend on that app for their day-to-day 
business.

In order for the “unicode-only” locale users to use those apps, the app would 
need to be updated, or another app with the appropriate functionality would 
need to be selected.

However, that still doesn’t impact the current French users that are “ok” with 
their current non-Unicode app.  Yes, I would encourage them to move to Unicode, 
however they tend to not want to invest in migration when they don’t see an 
urgent need.

Since Windows depends on CLDR and ICU data, updates to that data means that 
those customers can experience pain when trying to upgrade to newer versions of 
Windows.  We get those support calls, they don’t tend to pester CLDR.

Which is why I suggested an “opt-in” alt form that apps wanting “civilized” 
behavior could opt-into (at least for long enough that enough badly behaved 
apps would be updated to warrant moving that to the default.)

The data for locales like French tends to have been very stable for decades.  
Changes to data for major locales like that are more disruptive than to newer 
emerging markets where the data is undergoing more churn.

-Shawn



Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 23:46, Shawn Steele wrote:


*>> *Keeping these applications outdated has no other benefit than providing a 
handy lobbying tool against support of NNBSP.

I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).


If they are obsolete apps, they don’t use CLDR / ICU, as these are designed for 
up-to-date and fully localized apps. So one hassle is off the table.


Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.


I didn’t look into these date interchanges but I suspect they won’t use any 
thousands separator at all to interchange data. The group separator is only for 
display and print, and there you may wish to use a compat library for obsolete 
apps, and a newest library for apps with Unicode support. If an app is so 
obsolete it will keep working without new data from ICU.


This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.


Again I don’t believe that apps are storing numbers with thousands separators 
in them. Not even spreadsheet software does do that. I say not even because 
these are high-end apps with latest locale data expected.

Sorry you did skip this one:

>> What are all these expected to do while localized with scripts outside 
Windows code pages?

Indeed that is the paradox, that Tirhuta users are entitled to use correct 
display with newest data, while Latin users are bothered indefinitely with old 
data and legacy display.


>> Also when you need those apps, just tailor your French accordingly.

Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.


Not the user. I’m addressing your concerns as coming from the developer side. I 
meant you should use the data as appropriate, and if a character is beyond 
support, just replace it for convenience.


>> That should not impact all other users out there interested in a civilized 
layout.

I’m not sure that the choice of the word “civilized” adds value to the 
conversation.


That is to express in a mouthful of English what user feedback is or can be, 
even if not all the time. Users are complaining about quotation marks spaced 
off too far when typeset with NBSP like Word does. It’s really ugly they say. 
NBSP is a character with precise usage, it’s not a one-size-fits-all. BTW as 
you are in the job, why does Word not provide an option with a checkbox letting 
the user set the space as desired? NBSP or NNBSP.


  We have pretty much zero feedback that the OS’s French formatting is 
“uncivilized” or that the NNBSP is required for correct support.


That is, at some point users stop submitting feedback when they see of how 
little use it is spending time to post it. From the pretty much zero you may 
wish to pick the one or two you get, guessing that for one you get there are 
one thousand other users out there having the same feedback but not submitting 
it. One thousand or one million, it’s hard to be precise…


>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
for.

For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.


You dont need that field in CLDR. Here’s how it works: Take the locale data, 
search-and-replace all NNBSP with NBSP, and here’s the library you’ll use.
Because NNBSP is not only in the group separator. I’d suggest to download 
common/main/fr.xml and check all instances of NNBSP. The legacy apps you’re 
referring to don’t use that data for sure. That data is for fine high-end apps 
and for user interfaces of Windows and any other OS. If you want your employer 
be well-served, you’d rather prefer the correct data, not legacy fallbacks.


Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not real

RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> Keeping these applications outdated has no other benefit than providing a 
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.
This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.
>> Also when you need those apps, just tailor your French accordingly.
Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.
>> That should not impact all other users out there interested in a civilized 
>> layout.
I’m not sure that the choice of the word “civilized” adds value to the 
conversation.  We have pretty much zero feedback that the OS’s French 
formatting is “uncivilized” or that the NNBSP is required for correct support.
>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
>> for.
For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.
Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not really the best practice, particularly 
in modern computing, but I suspect you’ll still find it taught in CS classes 
with little regard to things like NNBSP.
-Shawn



Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 22:03, Shawn Steele via Unicode wrote:


I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).

13/01/2019  09:48    15?360 AcXtrnal.dll

13/01/2019  09:46    54?784 AdaptiveCards.dll

13/01/2019  09:46    67?584 AddressParser.dll

13/01/2019  09:47    24?064 adhapi.dll

13/01/2019  09:47    97?792 adhsvc.dll

10/04/2013  08:32   154?624 AdjustCalendarDate.exe

10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb

13/01/2019  10:47   534?016 AdmTmpl.dll

13/01/2019  09:48    58?368 adprovider.dll

13/01/2019  10:47   136?704 adrclient.dll

13/01/2019  09:48   248?832 adsldp.dll

13/01/2019  09:46   251?392 adsldpc.dll

13/01/2019  09:48   101?376 adsmsext.dll

13/01/2019  09:48   350?208 adsnt.dll

13/01/2019  09:46   849?920 adtschema.dll

13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.


Keeping these applications outdated has no other benefit than providing a handy 
lobbying tool against support of NNBSP. What are all these expected to do while 
localized with scripts outside Windows code pages?

Also when you need those apps, just tailor your French accordingly. That should 
not impact all other users out there interested in a civilized layout, that we 
cannot get with NBSP, as this is justifying and numbers are torn apart in 
justified layout, nor with FIGURE SPACE as recommended in UAX#14 because it’s 
too wide and has no other benefit. BTW figure space is the same question mark 
in Windows terminal I guess, based on the above.

As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is for. 
Any legacy program can always use downgraded data, you can even replace NBSP if 
the expected output is plain ASCII. Downgrading is straightforward, the reverse 
is not true, that is why vetters are working so hard during CLDR surveys. CLDR 
data is kind of high-end, that is the only useful goal. Again downgrading is 
easy, just run a tool on the data and the job is done. You’ll end up with two 
libraries instead of one, but at least you’re able to provide a good UX in 
environments supporting any UTF.

Best,

Marcel



RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).
13/01/2019  09:4815?360 AcXtrnal.dll
13/01/2019  09:4654?784 AdaptiveCards.dll
13/01/2019  09:4667?584 AddressParser.dll
13/01/2019  09:4724?064 adhapi.dll
13/01/2019  09:4797?792 adhsvc.dll
10/04/2013  08:32   154?624 AdjustCalendarDate.exe
10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb
13/01/2019  10:47   534?016 AdmTmpl.dll
13/01/2019  09:4858?368 adprovider.dll
13/01/2019  10:47   136?704 adrclient.dll
13/01/2019  09:48   248?832 adsldp.dll
13/01/2019  09:46   251?392 adsldpc.dll
13/01/2019  09:48   101?376 adsmsext.dll
13/01/2019  09:48   350?208 adsnt.dll
13/01/2019  09:46   849?920 adtschema.dll
13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.

-Shawn

 
http://blogs.msdn.com/shawnste



Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:


Covering existing character sets (National, International and Industry) was _an_ (not 
"the") important goal at the time: such coverage was understood as a necessary 
(although not sufficient) condition that would enable data migration to Unicode as well 
as enable Unicode-based systems to process and display non-Unicode data (by conversion).


I’d take this as a touchstone to infer that there were actual data files 
including standard typographic spaces as encoded in U+2000..U+2006, and 
electronic table layout using these: “U+2007 figure space has a fixed width, 
known as tabular width, which is the same width as digits used in tables. 
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?


May I remind you that the beginnings of Unicode predate the development of the world wide 
web. By 1993 the web had developed to where it was possible to easily access material 
written in different scripts and language, and by today it is certainly possible to 
"sample" material to check for character usage.

When Unicode was first developed, it was best to work from the definition of 
character sets and to assume that anything encoded in a give set was also used 
somewhere. Several corporations had assembled supersets of character sets that 
their products were supporting. The most extensive was a collection from IBM. 
(I'm blanking out on the name for this).

These collections, which often covered international standard character sets as 
well, were some of the prime inputs into the early drafts of Unicode. With the 
merger with ISO 10646 some characters from that effort, but not in the early 
Unicode drafts, were also added.

The code points from U+2000..U+2008 are part of that early collection.

Note, that prior to Unicode, no character set standard described in detail how 
characters were to be used (with exception, perhaps of control functions). 
Mostly, it was assumed that users knew what these characters were and the 
function of the character set was just to give a passive enumeration.

Unicode's character property model changed all that - but that meant that 
properties for all of the characters had to be determined long after they were 
first encoded in the original sources, and with only scant hints of the 
identity of what these were intended to be. (Often, the only hint was a 
character name and a rather poor bitmapped image).

If you want to know the "legacy" behavior for these characters, it is more useful, 
therefore, to see how they have been supported in existing software, and how they have been used in 
documents since then. That gives you a baseline for understanding whether any change or 
clarification of the properties of one of these code points will break "existing 
practice".

Breaking existing practice should be a dealbreaker, no matter how 
well-intentioned a change is. The only exception is where existing 
implementations are de-facto useless, because of glaring inconsistencies or 
other issues. In such exceptional cases, deprecating some interpretations of  
character may be a net win.

However, if there's a consensus interpretation of a given character the you can't just go 
in and change it, even if it would make that character work "better" for a 
given circumstance: you simply don't know (unless you research widely) how people have 
used that character in documents that work for them. Breaking those documents 
retroactively, is not acceptable.


That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs 
to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the 
*MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break 
for example those implementations relying on Gc=Zs for the purpose of applying 
a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use 
case of NNBSP: between an integer and a vulgar fraction, pointing an error in 
TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from 
occurring, which is required in style guides such as the Chicago Manual of 
Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the 
fraction is to be separated from a previous number, then a space can be used, 
choosing the appropriate width (normal, thin, zero width, and so on). For 
example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.”  Note 
that TUS has typeset this with the precomposed U+00BE, not with plain digits 
and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break 
property from A to GL does not break any implementation nor document. As of 
possible misuse of the character in ways other than intended, generally there 
is no point in using as br

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 19:02, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

I understand only better why a significant majority of UTC is hating French.

Francophobia is also palpable in Canada, beyond any technical reasons, 
especially in the IT industry. Hence the position of UTC is far from isolated. 
If ethic and personal considerations inflect decision-making, they should 
consistently be an integral part of discussions here. In that vein, I’d mention 
that by the time when Unicode was developed, there was a global hatred against 
France, that originated in French colonial and foreign politics since WWII, and 
was revived a few years ago by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟 
and killing the crew’s photographer, in the port of Auckland. That crime 
triggered a peak of anger.


Again, my recollections do *not support* any issues of _Francophobia_.

The Unicode Technical committee has always had French people on board, from the 
beginning, and I have witnessed no issues where they took up a different 
technical position based on language. Quite the opposite, the UTC generally 
appreciates when someone can provide native insights into the requirements for 
supporting a given language. How best to realize these requirements then 
becomes a joint effort.

If anything, the Unicode Consortium saw itself from the beginning in contrast 
to an IT culture for which internationalization at times was still something of 
an afterthought.

Given all that, I find your suggestions and  implications deeply hurtful and 
hope you will find a way to avoid a repetition in the future.

May I suggest that trying to rake over the past and apportion blame is 
generally less productive than _moving forward _and addressing the outstanding 
problems.


It is my last-resort track that I’m deeply convinced of. But I’m thankfully 
eased by not needing to discuss it here further.

To point a well-founded behavior is not to blame. You’ll note that I carefully 
founded how UTC was right in doing so if they did. I wasn’t aware that I was 
hurtful. You tell me, so I apologize. Please note, though, based on my past 
e‑mail, that I see UTC as a compound of multiple, sometimes antagonistic 
tendencies. Just an example to help understand what I mean: When Karl Pentzlin 
proposed to encode a missing French abbreviation indicator, a typographer was 
directed to argue (on behalf of his employer IIUC) that this would be a case of 
encoding all scripts in bold and italic. The OP protested that it wasn’t, but 
he was overheard. That example raises much concern, the more as we were told on 
this List that decision makers in UTC are refusing to join in open and public 
discussions here, are only “duelling ballot comments.”

Now since regardless of being right in doing so, they did not at all, I’m 
plunged again into disarray. May I quote Germaine Tillion, a French ethnologue: 
It’s important to understand what happens to us; to understand is to exist. ― 
Originally, “to exist” meant “to stand out.” That is still somewhat implied in 
the strong sense of “to exist.” Understanding does also help to overcome. 
That’s why I wrote one e‑mail before:

Nothing happens, or does not happen, without a good reason.
Finding out what reason is key to recoverage.
If we want to get what we need, we must do our homework first.

Thanks for helping bring it to the point.

Kind regards,

Marcel


Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode

  
  
Marcel,
about your many detailed *technical* questions about the history
  of character properties, I am afraid I have no specific
  recollection.
French is not the only language that uses a space to group
  figures. In fact, I grew up with thousands separators being
  spaces, but in much of the existing publications or documents
  there was certainly a full (ordinary) space being used. Not
  surprisingly, because in those years documents were typewritten
  and even many books were simply reproduced from typescript.
When it comes to figures, there are two different types of
  spaces.
One is a space that has the same width a digit and is used in the
  layout of lists. For example, if you have a leading currency
  symbol, you may want to have that lined up on the left and leave
  the digits representing the amounts "ragged". You would fill the
  intervening spaces with this "lining" space character and
  everything lines up.
In lists like that, you can get away with not using a narrow
  thousands separator, because the overall context of the list
  indicates which digits belong together and form a number. Having a
  narrow space may still look nicer, but complicates the space fill
  between the symbol and the digits.
Now for numbers in running text using an ordinary space has
  multiple drawbacks. It's definitely less readable and, in digital
  representation, if you use 0020 you don't communicate that this is
  part of a single number that's best not broken across lines.
The problem Unicode had is that it did not properly understand
  which of the two types of "numeric" spaces was represented by
  "figure space". (I remember that we had discussions on that during
  the early years, but that they were not really resolved and that
  we moved on to other issues, of which many were demanding
  attention).
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language using
  some form of blank as a thousands separator - solving only the
  French issue is not enough. We should not do this a language at a
  time. Do you have colleagues in Germany and other countries that
  can confirm whether their practice matches the French usage in all
  details, or whether there are differences? (Including differently
  acceptability of fallback renderings...).
(2) have a solution that works for lining figures as well as
  separators.
(3) have a solution that understands ALL uses of spaces that are
  narrower than normal space. Once a character exists in Unicode,
  people will use it on the basis of "closest fit" to make it do
  (approximately) what they want. Your proposal needs to address any
  issues that would be caused by reinterpreting a character more
  narrowly that it has been used. Only by comprehensively
  identifying ALL uses of comparable spaces in various languages and
  scripts, you can hope to develop a solution that doesn't simply
  break all non-French text in favor of supporting French
  typography.
Perhaps you see why this issue has languished for so long:
  getting it right is not a simple matter.
A./

  



Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode

  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:


  

  
Covering existing
character sets (National, International and Industry)
was an (not "the") important goal at
the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable
data migration to Unicode as well as enable
Unicode-based systems to process and display non-Unicode
data (by conversion). 
  
  

  
  I’d take this as a touchstone to infer that
there were actual data files including standard typographic
spaces as encoded in U+2000..U+2006, and electronic table layout
using these: “U+2007 figure space has a fixed width,
  known as tabular width, which is the same width as digits used in
  tables. U+2008 punctuation space is a space defined to be the same
  width as a period.” 
  Is that correct?
May I remind you that the beginnings of Unicode predate the
  development of the world wide web. By 1993 the web had developed
  to where it was possible to easily access material written in
  different scripts and language, and by today it is certainly
  possible to "sample" material to check for character usage. 

When Unicode was first developed, it was best to work from the
  definition of character sets and to assume that anything encoded
  in a give set was also used somewhere. Several corporations had
  assembled supersets of character sets that their products were
  supporting. The most extensive was a collection from IBM. (I'm
  blanking out on the name for this).
These collections, which often covered international standard
  character sets as well, were some of the prime inputs into the
  early drafts of Unicode. With the merger with ISO 10646 some
  characters from that effort, but not in the early Unicode drafts,
  were also added.
The code points from U+2000..U+2008 are part of that early
  collection.
Note, that prior to Unicode, no character set standard described
  in detail how characters were to be used (with exception, perhaps
  of control functions). Mostly, it was assumed that users knew what
  these characters were and the function of the character set was
  just to give a passive enumeration.
Unicode's character property model changed all that - but that
  meant that properties for all of the characters had to be
  determined long after they were first encoded in the original
  sources, and with only scant hints of the identity of what these
  were intended to be. (Often, the only hint was a character name
  and a rather poor bitmapped image).
If you want to know the "legacy" behavior for these characters,
  it is more useful, therefore, to see how they have been supported
  in existing software, and how they have been used in documents
  since then. That gives you a baseline for understanding whether
  any change or clarification of the properties of one of these code
  points will break "existing practice".
Breaking existing practice should be a dealbreaker, no matter how
  well-intentioned a change is. The only exception is where existing
  implementations are de-facto useless, because of glaring
  inconsistencies or other issues. In such exceptional cases,
  deprecating some interpretations of  character may be a net win.
However, if there's a consensus interpretation of a given
  character the you can't just go in and change it, even if it would
  make that character work "better" for a given circumstance: you
  simply don't know (unless you research widely) how people have
  used that character in documents that work for them. Breaking
  those documents retroactively, is not acceptable.
A./

  



Re: NNBSP

2019-01-18 Thread Asmus Freytag via Unicode

  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:

I understand only better
  why a significant majority of UTC is hating French.
  
Francophobia is also palpable in Canada, beyond any
technical reasons, especially in the IT industry. Hence the
position of UTC is far from isolated. If ethic and personal
considerations inflect decision-making, they should consistently
be an integral part of discussions here. In that vein, I’d
mention that by the time when Unicode was developed, there was a
global hatred against France, that originated in French colonial
and foreign politics since WWII, and was revived a few years ago
by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟
and killing the crew’s photographer, in the port of Auckland.
That crime triggered a peak of anger.
Again, my recollections do not support
any issues of Francophobia.
The Unicode Technical committee has always
had French people on board, from the beginning, and I have
witnessed no issues where they took up a different technical
position based on language. Quite the opposite, the UTC
generally appreciates when someone can provide native insights
into the requirements for supporting a given language. How best
to realize these requirements then becomes a joint effort.
  
If anything, the Unicode Consortium saw itself from the beginning
  in contrast to an IT culture for which internationalization at
  times was still something of an afterthought.
Given all that, I find your suggestions and  implications deeply
  hurtful and hope you will find a way to avoid a repetition in the
  future.
May I suggest that trying to rake over the past and apportion
  blame is generally less productive than moving forward and
  addressing the outstanding problems.
A./





  



Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 17/01/2019 20:11, 梁海 Liang Hai via Unicode wrote:

[Just a quick note to everyone that, I’ve just subscribed to this public list, 
and will look into this ongoing Mongolian-related discussion once I’ve mentally 
recovered from this week’s UTC stress. :)]


Welcome to Unicode Public.

Hopefully this discussion helps sort things out so that we’ll know both what to 
do wrt Mongolian and what to do wrt French.

On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode mailto:unicode@unicode.org>> wrote:

On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:

 [On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:]


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

Thank you for this insight. It is a still untold part of the history of Unicode.

This historical summary does *not *square in key points with my own 
recollection (I was there). I would therefore not rely on it as if gospel truth.

In particular, one of the key technologies that _brought industry partners to 
cooperate around Unicode_ was font technology, in particular the development of 
the /TrueType /Standard. I find it not credible that no typographers were part 
of that project :).


It is probably part of the (unintentional) fake blames spread by the cited 
author’s paper. My apologies for not sufficiently assessing the reliability of 
my sources. I’d already identified a number of errors but wasn’t savvy enough 
for seeing the other one reported by Richard Wordingham. Now the paper ends up 
as a mere libel. It doesn’t mention the lack of NNBSP, instead it piles up a 
bunch of gratuitous calumnies. Should that be the prevailing mood of average 
French professionals with respect to Unicode ― indeed Patrick Andries is the 
only French tech writer on Unicode I found whose work is acclaimed, the others 
are either disliked or silent (or libellists) ― then I understand only better 
why a significant majority of UTC is hating French.

Francophobia is also palpable in Canada, beyond any technical reasons, 
especially in the IT industry. Hence the position of UTC is far from isolated. 
If ethic and personal considerations inflect decision-making, they should 
consistently be an integral part of discussions here. In that vein, I’d mention 
that by the time when Unicode was developed, there was a global hatred against 
France, that originated in French colonial and foreign politics since WWII, and 
was revived a few years ago by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟 
and killing the crew’s photographer, in the port of Auckland. That crime 
triggered a peak of anger.


Covering existing character sets (National, International and Industry) was _an_ (not 
"the") important goal at the time: such coverage was understood as a necessary 
(although not sufficient) condition that would enable data migration to Unicode as well 
as enable Unicode-based systems to process and display non-Unicode data (by conversion).


I’d take this as a touchstone to infer that there were actual data files 
including standard typographic spaces as encoded in U+2000..U+2006, and 
electronic table layout using these: “U+2007 figure space has a fixed width, 
known as tabular width, which is the same width as digits used in tables. 
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?


The statement: "there was initially no desire to encode all the languages and 
scripts" is categorically false.


Though Unicode was designed as being limited to a 65 000 characters, and it was 
stated that historic scripts were out of scope, only livi

Re: NNBSP

2019-01-17 Thread Richard Wordingham via Unicode
On Thu, 17 Jan 2019 18:35:49 +0100
Marcel Schneider via Unicode  wrote:


> Among the grievances, Unicode is blamed for confusing Greek psili and
> dasia with comma shapes, and for misinterpreting Latin letter forms
> such as the u with descender taken for a turned h, and double u
> mistaken for a turned m, errors that subsequently misled font
> designers to apply misplaced serifs.

And I suppose that the influence was so great that it travelled back in
time to 1976, affecting the typography of the Pelican book 'Phonetics'
as reprinted in 1976.

Those IPA characters originated in a tradition where new characters had
been derived by rotating other characters so as to avoid having to have
new type cut.  Misplaced serifs appear to be original.

Richard.



Re: NNBSP

2019-01-17 Thread 梁海 Liang Hai via Unicode
[Just a quick note to everyone that, I’ve just subscribed to this public list, 
and will look into this ongoing Mongolian-related discussion once I’ve mentally 
recovered from this week’s UTC stress. :)]

Best,
梁海 Liang Hai
https://lianghai.github.io

> On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode  
> wrote:
> 
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>>> [quoted mail]
>>> 
>>> But the French "espace fine insécable" was requested long long before 
>>> Mongolian was discussed for encodinc in the UCS. The problem is that the 
>>> initial rush for French was made in a period where Unicode and ISO were 
>>> competing and not in sync, so no agreement could be found, until there was 
>>> a decision to merge the efforts. Tge early rush was in ISO still not using 
>>> any character model but a glyph model, with little desire to support 
>>> multiple whitespaces; on the Unicode side, there was initially no desire to 
>>> encode all the languages and scripts, focusing initially only on trying to 
>>> unify the existing vendor character sets which were already implemented by 
>>> a limited set of proprietary vendor implementations (notably IBM, 
>>> Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
>>> including the existing ISO 8859-*, GBK, and some national standard or de 
>>> facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this 
>>> time but still using another unrelated technology). Font standards were 
>>> still not existing and were competing in incompatible ways, all was a mess 
>>> at that time, so publishers were still required to use proprietary software 
>>> solutions, with very low interoperability (at that time the only "standard" 
>>> was PostScript, not needing any character encoding at all, but only 
>>> encoding glyphs!)
>> 
>> Thank you for this insight. It is a still untold part of the history of 
>> Unicode.
> This historical summary does not square in key points with my own 
> recollection (I was there). I would therefore not rely on it as if gospel 
> truth.
> 
> In particular, one of the key technologies that brought industry partners to 
> cooperate around Unicode was font technology, in particular the development 
> of the TrueType Standard. I find it not credible that no typographers were 
> part of that project :).
> 
> Covering existing character sets (National, International and Industry) was 
> an (not "the") important goal at the time: such coverage was understood as a 
> necessary (although not sufficient) condition that would enable data 
> migration to Unicode as well as enable Unicode-based systems to process and 
> display non-Unicode data (by conversion). 
> 
> The statement: "there was initially no desire to encode all the languages and 
> scripts" is categorically false.
> 
> (Incidentally, Unicode does not "encode languages" - no character encoding 
> does).
> 
> What has some resemblance of truth is that the understanding of how best to 
> encode whitespace evolved over time. For a long time, there was a confusion 
> whether spaces of different width were simply digital representations of 
> various metal blanks used in hot metal typography to lay out text. As the 
> placement of these was largely handled by the typesetter, not the author, it 
> was felt that they would be better modeled by variable spacing applied 
> mechanically during layout, such as applying indents or justification.
> 
> Gradually it became better understood that there was a second use for these: 
> there are situations where some elements of running text have a gap of a 
> specific width between them, such as a figure space, which is better treated 
> like a character under authors or numeric formatting control than something 
> that gets automatically inserted during layout and rendering.
> 
> Other spaces were found best modeled with a minimal width, subject to 
> expansion during layout if needed.
> 
> 
> 
> There is a wide range of typographical quality in printed publication. The 
> late '70s and '80s saw many books published by direct photomechanical 
> reproduction of typescripts. These represent perhaps the bottom end of the 
> quality scale: they did not implement many fine typographical details and 
> their prevalence among technical literature may have impeded the 
> understanding of what character encoding support would be needed for true 
> fine typography. At the same time, Donald Knuth was refining TeX to restore 
> high quality digital typography, initially for mathematics.
> 
> However, TeX did not have an underlying character encoding; it was using a 
> completely different model mediating between source data and final output. 
> (And it did not know anything about typography for other writing systems).
> 
> Therefore, it is not surprising that it took a while and a few false starts 
> to get the encoding model correct for space characters.
> 
> Hopefully, well 

Re: NNBSP

2019-01-17 Thread Asmus Freytag via Unicode

  
  
On 1/17/2019 9:35 AM, Marcel Schneider
  via Unicode wrote:


  
[quoted mail]
  


But the French "espace fine insécable" was requested
  long long before Mongolian was discussed for encodinc in
  the UCS. The problem is that the initial rush for French
  was made in a period where Unicode and ISO were competing
  and not in sync, so no agreement could be found, until
  there was a decision to merge the efforts. Tge early rush
  was in ISO still not using any character model but a glyph
  model, with little desire to support multiple whitespaces;
  on the Unicode side, there was initially no desire to
  encode all the languages and scripts, focusing initially
  only on trying to unify the existing vendor character sets
  which were already implemented by a limited set of
  proprietary vendor implementations (notably IBM,
  Microsoft, HP, Digital) plus a few of the registered
  chrsets in IANA including the existing ISO 8859-*, GBK,
  and some national standard or de facto standards (Russia,
  Thailand, Japan, Korea).
This early rush did not involve typographers (well
  there was Adobe at this time but still using another
  unrelated technology). Font standards were still not
  existing and were competing in incompatible ways, all was
  a mess at that time, so publishers were still required to
  use proprietary software solutions, with very low
  interoperability (at that time the only "standard" was
  PostScript, not needing any character encoding at all, but
  only encoding glyphs!)
  

  
  
  Thank you for this insight. It is a still untold part of the
  history of Unicode.
This historical summary does not square
in key points with my own recollection (I was there). I would
therefore not rely on it as if gospel truth.
  
In particular, one of the key technologies
that brought industry partners to cooperate around Unicode
was font technology, in particular the development of the TrueType
Standard. I find it not credible that no typographers were
part of that project :).
Covering existing character sets (National,
International and Industry) was an (not "the") important
goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data
migration to Unicode as well as enable Unicode-based systems to
process and display non-Unicode data (by conversion). 
  
The statement: "there was initially no
desire to encode all the languages and scripts" is categorically
false.
(Incidentally, Unicode does not "encode
languages" - no character encoding does).
What has some resemblance of truth is that
the understanding of how best to encode whitespace evolved over
time. For a long time, there was a confusion whether spaces of
different width were simply digital representations of various
metal blanks used in hot metal typography to lay out text. As
the placement of these was largely handled by the typesetter,
not the author, it was felt that they would be better modeled by
variable spacing applied mechanically during layout, such as
applying indents or justification.
  
Gradually it became better understood that
there was a second use for these: there are situations where
some elements of running text have a gap of a specific width
between them, such as a figure space, which is better treated
like a character under authors or numeric formatting control
than something that gets automatically inserted during layout
and rendering.
Other spaces were found best modeled with a
minimal width, subject to expansion during layout if needed.

  
There is a wide range of typographical
quality in printed publication. The late '70s and '80s saw many
books published by direct photomechanical reproduction of
typescripts. These represent perhaps the bottom end of the
quality scale: they did not implement many fine typographical
details and their prevalence among technical literature may have
impeded the understanding of what character encoding support
would be needed for true fine typography. At the same time,
Donald Knuth was refining TeX to restore high quality digital
typography, initially for mathematics.
However, TeX did not have an underlying
character encoding; it was using a completely different model
 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)


Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they 
have no computer science training, and because they were feared as trying to 
enforce requirements that Unicode were neither able nor willing to meet, such 
as distinct code points for italics, bold, small caps…

Among the grievances, Unicode is blamed for confusing Greek psili and dasia 
with comma shapes, and for misinterpreting Latin letter forms such as the u 
with descender taken for a turned h, and double u mistaken for a turned m, 
errors that subsequently misled font designers to apply misplaced serifs. 
Things were done in a hassle and a hurry, under the Damokles sword of a hostile 
ISO messing and menacing to unleash an unusable standard if Unicode wasn’t 
quicker.


If publishers had been involded, they would have revealed that they all needed 
various whitespaces for correct typography (i.e. layout). Typographs themselves 
did not care about whitespaces because they had no value for them (no glyph to 
sell).


Nevertheless the whole range of traditional space forms was admitted, despite 
they were going to be of limited usability. And they were given properties.
Or can’t the misdefinition of PUNCTUATION SPACE be backtracked to that era?


Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like 
Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required 
us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, 
dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained 
their typists or ads sellers to use it (that character was not "sold" in classified ads, 
it was necessary for correct layout, notably in narrow columns, not using it confused the readers 
(notably for the ":" colon): it had to be non-breaking, non-expanding by justification, 
narrower than digits and even narrower than standard non-justified whitespace, and was consistently 
used as a decimal grouping separator.


No doubt they were confident that when an UCS is set up, such an important 
character wouldn’t be skipped.
So confident that they never guessed that they had a key role in reviewing, in 
providing feedback, in lobbying.
Too bad that we’re still so few people today, corporate vetters included, 
despite much things are still going wrong.


But at that time the most common OSes did not support it natively because there 
was no vendor charset supporting it (and in fact most OSes were still unable to 
render proportional fonts everywhere and were frequently limited to 8-bit 
encodings (DOS, Windows, Unix(es), and even Linux at its early start).


Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren’t ready. Not 
even the NNBSP.

Perhaps it’s the poetic ‘justice of time’ that since Unicode is on, the 
Vietnamese are the foremost, and the French the hindmost.
[I’m alluding to the early lobbying of Vietnam for a comprehensive set of 
precomposed letters, while French wasn’t even granted to come into the benefit 
of the NNBSP – that according to PRI #308 [1] is today the only known use of 
NNBSP outside Mongolian – and a handful ordinal indicators (possibly along with 
the rest of the alphabet, except q).

[1] “The only other widely noted use for U+202F NNBSP is for representation of th

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 14:36, I wrote:

[…]
The only thing that searches have brought up


It was actually the best thing. Here’s an even more surprising hit:

   B. In the rules, allow these characters to bridge both 
alphabetic and numeric words, with:

 * Replace MidLetter by (MidLetter | MidNumLet)
 * Replace MidNum by (MidNum | MidNumLet)


   -

   4. In addition, the following are also sometimes used, or could 
be used, as numeric separators (we don't give much guidance as to the best 
choice in the standard):

   |0020 |( ) 
SPACE
   |00A0 |(   
) NO-BREAK SPACE
   |2007 |(   
) FIGURE SPACE
   |2008 |(   
) PUNCTUATION SPACE
   |2009 |(   
) THIN SPACE
   |202F |(   
) NARROW NO-BREAK SPACE

   If we had good reason to believe that if one of these only 
really occurred between digits in a single number, then we could add it. I 
don't have enough information to feel like a proposal for that is warranted, 
but others may. Short of that, we should at least document in the notes that 
some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn’t Unicode wish to sort out 
which one of these is the group separator?

1. SPACE: is breakable, hence exit.
2. NO-BREAK SPACE: is justifying, hence exit.
3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
4. PUNCTUATION SPACE: has been left breakable against all reason and evidence 
and consistency, hence exit…
5. THIN SPACE: is part of the breakable spaces series, hence exit.
6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey 35 
all is questioned again, must be assessed, may impact implementations, while 
all other locales using space are still impacted by bad display using NO-BREAK 
SPACE.

I know we have another public Mail List for that, but I feel it’s important to 
submit this to a larger community for consideration and eventually, for 
feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt









Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS.


Then we should be able to read its encoding proposal in the UTC document 
registry, but Google Search seems unable to retrieve it, so there is a big risk 
that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX #14 
that I’ve quoted in the parent thread has been added by a Unicode Technical 
Director not mentioned in the author field, and that he did it on request from 
two gentlemen whose first names only are cited. I’m sure their full names are 
Martin J. Dürst and Patrick Andries, but I may be wrong.

I apologize for the comment I’ve made in my e‑mail. Still it would be good to 
learn why the French use of NNBSP is sort of taken with a grain of salt, while 
all involved parties were knowing that this NNBSP was (as it still is) the only 
Unicode character ever encoded able to represent the so-long-asked-for “espace 
fine insécable.”

There is also another question I’m asking since a while: Why the character U+2008 
PUNCTUATION SPACE wasn’t given the line break property value "GL" like its 
sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as “2007-08-08”. Why was the Core 
Specification not updated in sync, but only a 7 years later? And was Unicode 
aware that this whitespace is hated by the industry to such an extent that a 
major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to 
beg for font support?


Regards,

Marcel


The problem is that the initial rush for French was made in a period where 
Unicode and ISO were competing and not in sync, so no agreement could be found, 
until there was a decision to merge the efforts. Tge early rush was in ISO 
still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no 
desire to encode all the languages and scripts, focusing initially only on 
trying to unify the existing vendor character sets which were already 
implemented by a limited set of proprietary vendor implementations (notably 
IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
including the existing ISO 8859-*, GBK, and some national standard or de facto 
standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
and was consistently used as a decimal grouping separator.


But at that time the most common OSes did not support it natively because there was no 
vendor charset supporting it (and in fact most OSes were still unable to render 
proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, 
Windows, Unix(es), and even Linux at its early start). So intermediate solution was 
needed. Us chose not to use at all the non-breakable thin space because in English it was 
not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for 
everything (but including its own national symbol for the "$", competing with 
other ISO 646 variants). There were tons of legacy applications developed ince decenials 
that did not support anything else and interoperability in US was available ony with 
ASCII, everything else was unreliable.

If you remember the early years w

Re: NNBSP

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 09:58, Richard Wordingham wrote:


On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:


Also, at least one French typographer was extremely upset
about Unicode not gathering feedback from typographers.
That blame is partly wrong since at least one typographer
was and still is present in WG2, and even if not being a
Frenchman (but knowing French), as an Anglophone he might
have been aware of the most outstanding use case of NNBSP
with English (both British and American) quotation marks
when a nested quotation starts or ends a quotation, where
_‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
least with proportional fonts.


There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.


Thanks, I didn’t know that this is already implemented. Sometimes one can
read in discussions that the issue is dismissed to font level. That looked
always utopistic to me, the more as people are trained to type spaces when
bringing in former typewriting expertise, and I always believed that it’s
a way for helpless keyboard layout designers to hand the job over.

Turns out there is more to it. But the high-end solution notwithstanding,
the use of an extra space character is recommended practice:

https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html

The source sums up in an overview: “_The Associated Press Stylebook_
recommends a thin space, whereas _The Gregg Reference Manual_ promotes a
full space between the quotation marks. _The Chicago Manual of Style_ says
no space is necessary but adds that a space or a thin space can be inserted
as ‘a typographical nicety.’ ” The author cites three other manuals for not
having retrieved any locus about the topic in them.

We note that all three style guides seem completely unconcerned with
non-breakability. Not so the author of the blog post: “[…] If your software
moves the double quotation mark to the next line of type, use a nonbreaking
space between the two marks to keep them together.” Certainly she would
recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard
or if the software provided a handy shortcut by default.



This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)


Another drawback is that most environments don’t provide OpenType support,
and that the whole scheme depends on language tags that could easily got
lost, and that the issue as being particular to French would quickly boil
down to dismiss support as not cost-effective, arguing that *if* some
individual locale has special requirements for punctuation layout, its
writers are welcome to pick an appropriate space from the UCS and key it
in as desired.

The same is also observed about Mongolian. Today, the preferred approach
for appending suffixes is to encode a Mongolian Suffix Connector to make
sure the renderer will use correct shaping, and to leave the space to the
writer’s discretion. That looks indeed much better than to impose a hard
space that unveiled itself as cumbersome in practice, and that is
reported to often get in the way of a usable text layout.

The problems related to NNBSP as encountered in Mongolian are completely
absent when NNBSP is used with French punctuation or as the regular
group separator in numbers. Hence I’m sure that everybody on this List
agrees in discouraging changes made to the character properties of NNBSP,
such as switching the line breaking class (as "GL" is non-tailorable), or
changing general category to Cf, which could be detrimental to French.

However we need to admit that NNBSP is basically not a Latin but a
Mongolian space, despite being readily attracted into Western typography.
A similar disturbance takes place in word processors, where except in
Microsoft Word 2013, the NBSP is not justifying as intended and as it is
on the web. It’s being hacked and hijacked despite being a bad compromise,
for the purpose of French punctuation spacing. That tailoring is in turn
very detrimental to Polish users, among others, who need a justifying
no-break space for the purpose of prepending one-letter prepositions.

Fortunately a Polish user found and shared a workaround using the string
, the latter being still used in lieu of WORD JOINER as
long as Word keeps unsupporting latest TUS (an issue that raised concern
at Microsoft when it was reported, and will probably be fixed or has
already been fixed meanwhile).



Another spacing m

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Philippe Verdy via Unicode
Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode  wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nig

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

Courier New was lacking NNBSP on Windows 7. It is including it on
Windows 10. The tests I referred to were made 2 years ago. I
confess that I was so disappointed to see Courier New unsupporting
NNBSP a decade after encoding, while many relevant people in the
industry were surely aware of its role and importance for French
(at least those keeping a branch office in France), that I gave it
up. Turns out that foundries are delaying support until the usage
is backed by TUS, which happened in 2014, timely for Windows 10.
(I’m lacking hints about Windows 8 and 8.1.)

Superscripts are a handy parallel showcasing a similar process.
As long as preformatted superscripts are outlawed by TUS for use
in the digital representation of abbreviation indicators, vendors
keep disturbing their glyphs with what one could start calling an
intentional metrics disorder (IMD). One can also rank the vendors
on the basis of the intensity of IMD in preformatted superscripts,
but this is not the appropriate thread, and anyhow this List is
not the place. A comment on CLDR ticket #11653 is better.

[…]

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. […]

Marcel


Re: NNBSP

2019-01-17 Thread Richard Wordingham via Unicode
On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:

> Also, at least one French typographer was extremely upset
> about Unicode not gathering feedback from typographers.
> That blame is partly wrong since at least one typographer
> was and still is present in WG2, and even if not being a
> Frenchman (but knowing French), as an Anglophone he might
> have been aware of the most outstanding use case of NNBSP
> with English (both British and American) quotation marks
> when a nested quotation starts or ends a quotation, where
> _‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
> unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
> least with proportional fonts.

There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.

This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)

Another spacing mess occurs with the Thai repetition mark U+0E46 THAI
CHARACTER MAIYAMOK, which is supposed to be separated from the
duplicated word by a space.  I'm not sure whether this space should
expand for justification any more often than inter-letter spacing. Some
fonts have taken to including the preceding space in the character's
glyph, which messes up interoperability.  An explicit space looks ugly
when the font includes the space in the repetition mark, and the lack of
an explicit space looks illiterate when the font excludes the leading
space.

Richard.



Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-16 Thread Marcel Schneider via Unicode

On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:


On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode  wrote:


If your fonts behave incorrectly on your system because it does not
map any glyph for NNBSP, don't blame the font or Unicode about this
problem, blame the renderer (or the application or OS using it, may
be they are very outdated and were not aware of these features, theyt
are probably based on old versions of Unicode when NNBSP was still
not present even if it was requested since very long at least for
French and even English, before even Unicode, and long before
Mongolian was then encoded, only in Unicode and not in any known
supported legacy charset: Mongolian was specified by borrowing the
same NNBSP already designed for Latin, because the Mongolian space
had no known specific behavior: the encoded whitespaces in Unicode
are compeltely script-neutral, they are generic, and are even
BiDi-neutral, they are all usable with any script).


The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.


Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
more. When Unicode argued in favor of a unification with , this was
pointed as impracticable, and the need of a specific Mongolian space for
the purpose of appending suffixes was underscored. Only in London in
September 1998 it was agreed that “The Mongolian Space is retained but
moved to the general punctuation block and renamed ‘Narrow No Break Space’ ”.

However, unlike for the Mongolian Combination Symbols sequencing a question
and exclamation mark both ways, a concrete rationale as of how useful the
 could be in other scripts doesn’t seem to be put on the table when
the move to General Punctuation was decided.



Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:


As a side-note: The relevant text of TUS doesn’t predate version 11 (2018).



1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.


I don’t believe that these additions to TUS are in any way able to fix
the many issues with  in Mongolian causing so much headache and
ending up in a unanimous desire to replace  with a *new*
*MONGOLIAN SUFFIX CONNECTOR. Indeed some suffixes are as long as 7 letters,
e.g. “ ᠲᠠᠶᠢᠭᠠᠨ ”

https://lists.w3.org/Archives/Public/public-i18n-mongolian/2015JulSep/att-0036/DS05_Mongolian_NNBSP_Connected_Suffixes.pdf



Or is property (3) appropriate for French?


No it isn’t. It only introduces new flaws for a character that,
despite being encoded for Mongolian with specific handling intended,
was readily ripped off for use in French, Philippe Verdy reported,
to that extent that it is actually an encoding error in Mongolian
that brought the long-missing narrow non-breakable thin space into
the UCS, in the block where it really belongs to, and where it had
been encoded in the beginning if there had been no desire to keep
it proprietary.

That is the hidden (almost occult) fact where stances like “The
NNBSP can be used to represent the narrow space occurring around
punctuation characters in French typography, which is called an
‘espace fine insécable.’ ” (TUS) and “When NARROW NO-BREAK SPACE
occurs in French text, it should be interpreted as an ‘espace fine
insécable’.” (UAX #14) are stemming from. The underlying meaning
as I understand it now is like: “The non-breakable thin space is
usually a vendor-specific layout control in DTP applications; it’s
also available via a TeX command. However, if you are interested
in an interoperable representation, here’s a Unicode character you
can use instead.”

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. I’ll need to use
a font manager to output a complete list wrt NNBSP support.

I’m utterly worried about the fate of the non-breaking thin
space in Unicode, and I wonder why the French and Canadian
French people present at setup – either on Unicode side or
on JTC1/SC2/WG2 side – didn’t get this character encoded in
the initial rush. Did they really sell themselves and their
locales to DTP lobbyists? Or were they tricked out?

Also, at least one

NNBSP (was: A last missing link for interoperable representation)

2019-01-16 Thread Richard Wordingham via Unicode
On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode  wrote:

> If your fonts behave incorrectly on your system because it does not
> map any glyph for NNBSP, don't blame the font or Unicode about this
> problem, blame the renderer (or the application or OS using it, may
> be they are very outdated and were not aware of these features, theyt
> are probably based on old versions of Unicode when NNBSP was still
> not present even if it was requested since very long at least for
> French and even English, before even Unicode, and long before
> Mongolian was then encoded, only in Unicode and not in any known
> supported legacy charset: Mongolian was specified by borrowing the
> same NNBSP already designed for Latin, because the Mongolian space
> had no known specific behavior: the encoded whitespaces in Unicode
> are compeltely script-neutral, they are generic, and are even
> BiDi-neutral, they are all usable with any script).

The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.

Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:

1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.

Or is property (3) appropriate for French?

Richard.


Re: NNBSP and Word Boundaries

2015-10-04 Thread Richard Wordingham
On Fri, 2 Oct 2015 09:25:01 +0200
Mark Davis ☕️ <m...@macchiato.com> wrote:

> We add:
> 
> WB13c Mongolian_Letter × NNBSP
> WB13d NNBSP × Mongolian_Letter
> 
> *If* we want to also change behavior on the other side of the NNBSP,
> whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2
> additional rules (with the appropriate values for ..., like Numeric)
> 
> WB13c Mongolian_Letter NNBSP   (...)
> WB13d (...) × NNBSP Mongolian_Letter

I'll assume the last two are meant to be WB13e and WB13f.

We can achieve the effects down to the first WB13d simply by changing
NNBSP from XX to MidNumLet.  This would also provide a proper "espace
fine" for French use within numbers
( https://www.druide.com/enquetes/pour-des-espaces-ins%C3%A9cables-impeccables
) to separate groups of 3 digits.  This needs *no* extra rules.

Now for combined numbers and letters, we might consider adding the two
rules:

WB12a Numeric MidNumLet × AHLetter
WB12b Numeric × MidNumLet AHLetter

I think we should go the whole hog, and instead have

WB12c (Numeric|AHLetter) MidNumLetQ × (Numeric|AHLetter)
WB12d (Numeric|AHLetter) × MidNumLetQ (Numeric|AHLetter)

Perhaps there are good reasons against them - I'm not aware of any.  (I
don't think it is wrong to treat "no.2" as a single word.)  These rules
would make the abbreviated names of a good many Thai forms (e.g. คร.๒, a
marriage certificate) into a single word.

WB12c and WB12d overlap with WB6, WB7, WB11 and WB12, which could be
slightly simplified. 

Richard.



NNBSP and Word Boundaries

2015-10-01 Thread Richard Wordingham
The background document for PRI #308 (Property Change for NNBSP),
http://www.unicode.org/review/pri308/pri308-background.html , says,

"The only other widely noted use for U+202F NNBSP is for representation
of the thin non-breaking space (espace fine insécable) regularly seen
next to certain punctuation marks in French style typography. However,
the word segmentation change for U+202F should have no impact in that
context, as ExtendNumLet is explicitly for preventing breaks between
letters, but does not prevent the identification of word boundaries
next to punctuation marks."

Unfortunately, this isn't quite true.  In the text fragment "
dit: ", there would be internal word-boundaries before 'd' and
before and after ':', but the word isolated would be the four characters
"dit".  One solution would be replace NNBSP by U+2009 THIN
SPACE, for with untailored line-breaking there would be no line break
between it and the 't' or colon, but there would be a word break
between the 't' and the thin space.

The problem is that characters with property ExtendNumLet can be the
first or last character of a word as well as a character strictly
within a word.  In this respect, the property differs from characters
with the property MidNumLet.  The problem with using that property
instead is that such characters, such as FULL STOP, may be flanked by
letters or numbers within a word, but not both.  The problem then
arises with the Mongolian analogue of '4th' etc. - it is written digit,
NNBSP, letters, and is a single word.

Richard.