subject:"NNBSP"

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:

On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:

Marcel,

about your many detailed *technical* questions about the history of character
properties, I am afraid I have no specific recollection.

Other List Members are welcome to join in, many of whom are aware of how things
happened. My questions are meant to be rather simple. Summing up the premium
ones:

1. Why does UTC ignore the need of a non-breakable thin space?
2. Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with
proportional advance width were used to write books ready for print.

Another question you do answer below:

French is not the only language that uses a space to group figures. In fact, I
grew up with thousands separators being spaces, but in much of the existing
publications or documents there was certainly a full (ordinary) space being
used. Not surprisingly, because in those years documents were typewritten and
even many books were simply reproduced from typescript.

When it comes to figures, there are two different types of spaces.

One is a space that has the same width a digit and is used in the layout of lists. For example, if
you have a leading currency symbol, you may want to have that lined up on the left and leave the
digits representing the amounts "ragged". You would fill the intervening spaces with this
"lining" space character and everything lines up.

That is exactly how I understood hot-metal typesetting of tables. What
surprises me is why computerized layout does work the same way instead of using
tabulations and appropriate tab stops (left, right, centered, decimal [with all
decimal separators lining up vertically).

==> At the time Unicode was first created (and definitely before that, during the time of
non-universal character sets) many applications existed that used a "typewriter
model" and worked by space fill rather than decimal-point tabulation.

If you are talking about applications, as opposed to typesetting tables for
book printing, then I’d suggest that the fixed-width display of tables could be
done much like still today’s source code layout, where normal space is used for
that purpose. In this use case, line wrap is typically turned off. That could
make non-breakable spaces sort of pointless (but I’m aware of your point
below), except if people are expected to re-use the data in other environments.
In that case, best practice is to use NNBSP as thousands separator while
displaying it like other monospace characters. That’s at least how today’s
monospace fonts work (provided they’re used in environments actually supporting
Unicode, which may not happen with applications running in terminal).

From today's perspective that older model is inflexible and not the best
approach, but it is impossible to say how long this legacy approach hung on in
some places and how much data might exist that relied on certain long-standing
behaviors of these space characters.

My position since some time is that legacy apps should use legacy libraries.
But I’ll come back on this when responding to Shawn Steele.

For a good solution, you always need to understand

(1) the requirement of your "index" case (French, in this case)

That’s okay.

(2) how it relates to similar requirements in (all!) other languages / scripts

That’s rather up to CLDR as I suggested, given it has the means to submit a
point to all vetters. See again below (in the part that you’ve cut off without
consideration).

(3) how it relates to actual legacy practice

That’s Shawn Steele’s point (see next reply).

(3a) what will suddenly no longer work if you change the properties on some
character

(3b) what older data will no longer work if the effective behavior of newer
applications changes

I’ll already note that this needs to be aware of actual use cases and/or to
delve into the OSes, that is far beyond what I can currently do, both wrt time
and wrt resources. The vetter’s role is to inform CLDR with correct data from
their locale. CLDR is then welcome to sort things out and to get in touch with
the industry, which CLDR TC is actually doing. But that has no impact on the
data submitted at survey time. Changing votes to tell “OK let the group
separator be NBSP as long as…” would be a lie.

In lists like that, you can get away with not using a narrow thousands
separator, because the overall context of the list indicates which digits
belong together and form a number. Having a narrow space may still look nicer,
but complicates the space fill between the symbol and the digits.

It does not, provided that all numbers have thousands separators, even if
filling with spaces. It looks nicer because it’s more legible.

Now for numbers in running text using an o

Re: NNBSP

2019-01-18 Thread Richard Wordingham via Unicode

On Fri, 18 Jan 2019 10:20:22 -0800
Asmus Freytag via Unicode  wrote:

> However, if there's a consensus interpretation of a given character
> the you can't just go in and change it, even if it would make that
> character work "better" for a given circumstance: you simply don't
> know (unless you research widely) how people have used that character
> in documents that work for them. Breaking those documents
> retroactively, is not acceptable.

Unless the UCD contains a contrary definition only usable where the
character wouldn't normally be used, in which case it is fine to try
to kick the character's users in the teeth. I am referring to the
belief that ZWSP separated words, whereas the UCD only defined it as a
lay-out control.  That outlawed belief has recently been very helpful
to me in using (as opposed to testing) a nod-Lana spell-checker on
Firefox.

Richard.

Re: NNBSP


  
  
On 1/18/2019 2:46 PM, Shawn Steele via
  Unicode wrote:


  >> That
should not impact all other users out there interested in a
civilized layout.
  I’m not sure
that the choice of the word “civilized” adds value to the
conversation.  We have pretty much zero feedback that the OS’s
French formatting is “uncivilized” or that the NNBSP is required
for correct support.  
  >> As long
as SegoeUI has NNBSP support, no worries, that’s what CLDR data
is for.
  For
compatibility, I’d actually much prefer that CLDR have an alt
“best practice” field that maintained the existing U+00A0
behavior for compatibility, yet allowed applications wanting the
newer typographic experience to opt-in to the “best practice”
alternative data.  As applications became used to the idea of an
alternative for U+00A0, then maybe that could be flip-flopped
and put U+00A0 into a “legacy” alt form in a few years.
  Normally I’m all
for having the “best” data in CLDR, and there are many locales
that have data with limited support for whatever reasons. 
U+00A0 is pretty exceptional in my view though, developers have
been hard-coding dependencies on that value for ½ a century
without even realizing there might be other types of
non-breaking spaces.  Sure, that’s not really the best practice,
particularly in modern computing, but I suspect you’ll still
find it taught in CS classes with little regard to things like
    NNBSP.

Shwan, 
  
having information on "common fallbacks"
would be useful. If formatting numbers, I may be free to pick
the "best", but when parsing for numbers I may want to know what
deviations from "best" practice I can expect.
A./

Re: NNBSP

ose were the correct set
  (just like we took punctuation from ASCII and similar sources and
  only added to it later, when we understood that they were missing
  things --- generally always added, generally did not redefine
  behavior or shape of existing code points).

 
  Current practice in electronic publishing was to use a
  non-breakable thin space, Philippe Verdy reports. Did that
  information come in somehow?

==> probably not in the early days. Y

 
  ISO 31-0 was published in 1992, perhaps too late for Unicode. It
  is normally understood that the thousands separator should not
  have the width of a digit. The allaged reason is security. Though
  on a typewriter, as you state, there is scarcely any other option.
  By that time, all computerized text was fixed width, Philippe
  Verdy reports. On-screen, I figure out, not in book print

==> much book printing was also done by photomechanically
  reproducing typescript at that time. Not everybody wanted to pay
  typesetters and digital typesetting wasn't as advanced. I actually
  did use a digital phototypesetter of the period a few years before
  I joined Unicode, so I know. It was more powerful than a
  typewriter, but not as powerful as TeX or later the Adobe
  products.
For one, you didn't typeset a page, only a column of text, and it
  required manual paste-up etc.


  
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language
  using some form of blank as a thousands separator - solving
  only the French issue is not enough. We should not do this a
  language at a time.
  
  That is how CLDR works. 
CLDR data is by definition per-language. Except for inheritance,
  languages are independent.
There are no "French" characters. When you encode characters, at
  best, some code points may be script-specific. For punctuation and
  spaces not even that may be the case. Therefore, as long as you
  try to solve this as if it only was a French problem, you
  are not doing proper character encoding.




  
 Do you have colleagues in Germany and other countries that
  can confirm whether their practice matches the French usage in
  all details, or whether there are differences? (Including
  differently acceptability of fallback renderings...).
  
  No I don’t but people may wish to read German Wikipedia:
  
  https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen
  
  Shared in ticket #11423:
  https://unicode.org/cldr/trac/ticket/11423#comment:15



==> for your proposal to be effective, you need to reach out.

 
  
(2) have a solution that works for lining figures as well as
  separators.
(3) have a solution that understands ALL uses of spaces that
  are narrower than normal space. Once a character exists in
  Unicode, people will use it on the basis of "closest fit" to
  make it do (approximately) what they want. Your proposal needs
  to address any issues that would be caused by reinterpreting a
  character more narrowly that it has been used. Only by
  comprehensively identifying ALL uses of comparable spaces in
  various languages and scripts, you can hope to develop a
  solution that doesn't simply break all non-French text in
  favor of supporting French typography.
  
  There is no such problem except that NNBSP has never worked
  properly in Mongolian. It was an encoding error, and that is the
  reason why to date, all font developers unanimously request the
  Mongolian Suffix Connector. That leaves the NNBSP for what it is
  consistently used outside Mongolian: a non-breakable thin space,
  kind of a belated avatar 
of what
  PUNCTUATION SPACE should have been since the beginning.

==> I mentioned before that if something is universally
  "broken" it can sometimes be resurrected, because even if you
  change its behavior retroactively, it will not change something
  that ever worked correctly. (But you need to be sure that nobody
  repurposed the NNBSP for something useful that is different from
  what you intend to use it for, otherwise you can't change anything
  about it).

If, however, you are merely adding a use for some existing
  character that does not affect its properties, that is usually not
  as much of a problem - as long as we can have some confidence that
  both usages will continue to be possible.


  
Perhaps you see why this issue has languished for so long:
  getting it right is not a simple matte

RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode

>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed 
>> for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a 
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any 
>> thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. 
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

>> What are all these expected to do while localized with scripts outside 
>> Windows code pages?

(We call those “unicode-only” locales FWIW)

The users that are not supported by legacy apps can’t use those apps 
(obviously).  And folks are strongly encouraged to write apps (and protocols) 
that Use Unicode (I’ve blogged about that too).  However, the fact that an app 
may run very poorly in Cherokee or whatever doesn’t mean that there aren’t a 
bunch of French enterprises that depend on that app for their day-to-day 
business.

In order for the “unicode-only” locale users to use those apps, the app would 
need to be updated, or another app with the appropriate functionality would 
need to be selected.

However, that still doesn’t impact the current French users that are “ok” with 
their current non-Unicode app.  Yes, I would encourage them to move to Unicode, 
however they tend to not want to invest in migration when they don’t see an 
urgent need.

Since Windows depends on CLDR and ICU data, updates to that data means that 
those customers can experience pain when trying to upgrade to newer versions of 
Windows.  We get those support calls, they don’t tend to pester CLDR.

Which is why I suggested an “opt-in” alt form that apps wanting “civilized” 
behavior could opt-into (at least for long enough that enough badly behaved 
apps would be updated to warrant moving that to the default.)

The data for locales like French tends to have been very stable for decades.  
Changes to data for major locales like that are more disruptive than to newer 
emerging markets where the data is undergoing more churn.

-Shawn

Re: NNBSP

On 18/01/2019 23:46, Shawn Steele wrote:

*>> *Keeping these applications outdated has no other benefit than providing a 
handy lobbying tool against support of NNBSP.

I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).

If they are obsolete apps, they don’t use CLDR / ICU, as these are designed for 
up-to-date and fully localized apps. So one hassle is off the table.

Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.

I didn’t look into these date interchanges but I suspect they won’t use any 
thousands separator at all to interchange data. The group separator is only for 
display and print, and there you may wish to use a compat library for obsolete 
apps, and a newest library for apps with Unicode support. If an app is so 
obsolete it will keep working without new data from ICU.

This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.

Again I don’t believe that apps are storing numbers with thousands separators 
in them. Not even spreadsheet software does do that. I say not even because 
these are high-end apps with latest locale data expected.

Sorry you did skip this one:

>> What are all these expected to do while localized with scripts outside 
Windows code pages?

Indeed that is the paradox, that Tirhuta users are entitled to use correct 
display with newest data, while Latin users are bothered indefinitely with old 
data and legacy display.

>> Also when you need those apps, just tailor your French accordingly.

Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.

Not the user. I’m addressing your concerns as coming from the developer side. I 
meant you should use the data as appropriate, and if a character is beyond 
support, just replace it for convenience.

>> That should not impact all other users out there interested in a civilized 
layout.

I’m not sure that the choice of the word “civilized” adds value to the 
conversation.

That is to express in a mouthful of English what user feedback is or can be, 
even if not all the time. Users are complaining about quotation marks spaced 
off too far when typeset with NBSP like Word does. It’s really ugly they say. 
NBSP is a character with precise usage, it’s not a one-size-fits-all. BTW as 
you are in the job, why does Word not provide an option with a checkbox letting 
the user set the space as desired? NBSP or NNBSP.

  We have pretty much zero feedback that the OS’s French formatting is 
“uncivilized” or that the NNBSP is required for correct support.

That is, at some point users stop submitting feedback when they see of how 
little use it is spending time to post it. From the pretty much zero you may 
wish to pick the one or two you get, guessing that for one you get there are 
one thousand other users out there having the same feedback but not submitting 
it. One thousand or one million, it’s hard to be precise…

>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
for.

For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.

You dont need that field in CLDR. Here’s how it works: Take the locale data, 
search-and-replace all NNBSP with NBSP, and here’s the library you’ll use.
Because NNBSP is not only in the group separator. I’d suggest to download 
common/main/fr.xml and check all instances of NNBSP. The legacy apps you’re 
referring to don’t use that data for sure. That data is for fine high-end apps 
and for user interfaces of Windows and any other OS. If you want your employer 
be well-served, you’d rather prefer the correct data, not legacy fallbacks.

Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not real

RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode

>> Keeping these applications outdated has no other benefit than providing a 
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.
This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.
>> Also when you need those apps, just tailor your French accordingly.
Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.
>> That should not impact all other users out there interested in a civilized 
>> layout.
I’m not sure that the choice of the word “civilized” adds value to the 
conversation.  We have pretty much zero feedback that the OS’s French 
formatting is “uncivilized” or that the NNBSP is required for correct support.
>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
>> for.
For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.
Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not really the best practice, particularly 
in modern computing, but I suspect you’ll still find it taught in CS classes 
with little regard to things like NNBSP.
-Shawn

Re: NNBSP


On 18/01/2019 22:03, Shawn Steele via Unicode wrote:


I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).

13/01/2019  09:48    15?360 AcXtrnal.dll

13/01/2019  09:46    54?784 AdaptiveCards.dll

13/01/2019  09:46    67?584 AddressParser.dll

13/01/2019  09:47    24?064 adhapi.dll

13/01/2019  09:47    97?792 adhsvc.dll

10/04/2013  08:32   154?624 AdjustCalendarDate.exe

10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb

13/01/2019  10:47   534?016 AdmTmpl.dll

13/01/2019  09:48    58?368 adprovider.dll

13/01/2019  10:47   136?704 adrclient.dll

13/01/2019  09:48   248?832 adsldp.dll

13/01/2019  09:46   251?392 adsldpc.dll

13/01/2019  09:48   101?376 adsmsext.dll

13/01/2019  09:48   350?208 adsnt.dll

13/01/2019  09:46   849?920 adtschema.dll

13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.


Keeping these applications outdated has no other benefit than providing a handy 
lobbying tool against support of NNBSP. What are all these expected to do while 
localized with scripts outside Windows code pages?

Also when you need those apps, just tailor your French accordingly. That should 
not impact all other users out there interested in a civilized layout, that we 
cannot get with NBSP, as this is justifying and numbers are torn apart in 
justified layout, nor with FIGURE SPACE as recommended in UAX#14 because it’s 
too wide and has no other benefit. BTW figure space is the same question mark 
in Windows terminal I guess, based on the above.

As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is for. 
Any legacy program can always use downgraded data, you can even replace NBSP if 
the expected output is plain ASCII. Downgrading is straightforward, the reverse 
is not true, that is why vetters are working so hard during CLDR surveys. CLDR 
data is kind of high-end, that is the only useful goal. Again downgrading is 
easy, just run a tool on the data and the job is done. You’ll end up with two 
libraries instead of one, but at least you’re able to provide a good UX in 
environments supporting any UTF.

Best,

Marcel

RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode

I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).
13/01/2019  09:4815?360 AcXtrnal.dll
13/01/2019  09:4654?784 AdaptiveCards.dll
13/01/2019  09:4667?584 AddressParser.dll
13/01/2019  09:4724?064 adhapi.dll
13/01/2019  09:4797?792 adhsvc.dll
10/04/2013  08:32   154?624 AdjustCalendarDate.exe
10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb
13/01/2019  10:47   534?016 AdmTmpl.dll
13/01/2019  09:4858?368 adprovider.dll
13/01/2019  10:47   136?704 adrclient.dll
13/01/2019  09:48   248?832 adsldp.dll
13/01/2019  09:46   251?392 adsldpc.dll
13/01/2019  09:48   101?376 adsmsext.dll
13/01/2019  09:48   350?208 adsnt.dll
13/01/2019  09:46   849?920 adtschema.dll
13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.

-Shawn

 
http://blogs.msdn.com/shawnste

Re: NNBSP

On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

Covering existing character sets (National, International and Industry) was _an_ (not
"the") important goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data migration to Unicode as well
as enable Unicode-based systems to process and display non-Unicode data (by conversion).

I’d take this as a touchstone to infer that there were actual data files
including standard typographic spaces as encoded in U+2000..U+2006, and
electronic table layout using these: “U+2007 figure space has a fixed width,
known as tabular width, which is the same width as digits used in tables.
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?

May I remind you that the beginnings of Unicode predate the development of the world wide
web. By 1993 the web had developed to where it was possible to easily access material
written in different scripts and language, and by today it is certainly possible to
"sample" material to check for character usage.

When Unicode was first developed, it was best to work from the definition of
character sets and to assume that anything encoded in a give set was also used
somewhere. Several corporations had assembled supersets of character sets that
their products were supporting. The most extensive was a collection from IBM.
(I'm blanking out on the name for this).

These collections, which often covered international standard character sets as
well, were some of the prime inputs into the early drafts of Unicode. With the
merger with ISO 10646 some characters from that effort, but not in the early
Unicode drafts, were also added.

The code points from U+2000..U+2008 are part of that early collection.

Note, that prior to Unicode, no character set standard described in detail how
characters were to be used (with exception, perhaps of control functions).
Mostly, it was assumed that users knew what these characters were and the
function of the character set was just to give a passive enumeration.

Unicode's character property model changed all that - but that meant that
properties for all of the characters had to be determined long after they were
first encoded in the original sources, and with only scant hints of the
identity of what these were intended to be. (Often, the only hint was a
character name and a rather poor bitmapped image).

If you want to know the "legacy" behavior for these characters, it is more useful,
therefore, to see how they have been supported in existing software, and how they have been used in
documents since then. That gives you a baseline for understanding whether any change or
clarification of the properties of one of these code points will break "existing
practice".

Breaking existing practice should be a dealbreaker, no matter how
well-intentioned a change is. The only exception is where existing
implementations are de-facto useless, because of glaring inconsistencies or
other issues. In such exceptional cases, deprecating some interpretations of
character may be a net win.

However, if there's a consensus interpretation of a given character the you can't just go
in and change it, even if it would make that character work "better" for a
given circumstance: you simply don't know (unless you research widely) how people have
used that character in documents that work for them. Breaking those documents
retroactively, is not acceptable.

That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs
to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the
*MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break
for example those implementations relying on Gc=Zs for the purpose of applying
a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use
case of NNBSP: between an integer and a vulgar fraction, pointing an error in
TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from
occurring, which is required in style guides such as the Chicago Manual of
Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the
fraction is to be separated from a previous number, then a space can be used,
choosing the appropriate width (normal, thin, zero width, and so on). For
example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.” Note
that TUS has typeset this with the precomposed U+00BE, not with plain digits
and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break
property from A to GL does not break any implementation nor document. As of
possible misuse of the character in ways other than intended, generally there
is no point in using as br

Re: NNBSP

On 18/01/2019 19:02, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

I understand only better why a significant majority of UTC is hating French.

Francophobia is also palpable in Canada, beyond any technical reasons,
especially in the IT industry. Hence the position of UTC is far from isolated.
If ethic and personal considerations inflect decision-making, they should
consistently be an integral part of discussions here. In that vein, I’d mention
that by the time when Unicode was developed, there was a global hatred against
France, that originated in French colonial and foreign politics since WWII, and
was revived a few years ago by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟
and killing the crew’s photographer, in the port of Auckland. That crime
triggered a peak of anger.

Again, my recollections do *not support* any issues of _Francophobia_.

The Unicode Technical committee has always had French people on board, from the
beginning, and I have witnessed no issues where they took up a different
technical position based on language. Quite the opposite, the UTC generally
appreciates when someone can provide native insights into the requirements for
supporting a given language. How best to realize these requirements then
becomes a joint effort.

If anything, the Unicode Consortium saw itself from the beginning in contrast
to an IT culture for which internationalization at times was still something of
an afterthought.

Given all that, I find your suggestions and implications deeply hurtful and
hope you will find a way to avoid a repetition in the future.

May I suggest that trying to rake over the past and apportion blame is
generally less productive than _moving forward _and addressing the outstanding
problems.

It is my last-resort track that I’m deeply convinced of. But I’m thankfully
eased by not needing to discuss it here further.

To point a well-founded behavior is not to blame. You’ll note that I carefully
founded how UTC was right in doing so if they did. I wasn’t aware that I was
hurtful. You tell me, so I apologize. Please note, though, based on my past
e‑mail, that I see UTC as a compound of multiple, sometimes antagonistic
tendencies. Just an example to help understand what I mean: When Karl Pentzlin
proposed to encode a missing French abbreviation indicator, a typographer was
directed to argue (on behalf of his employer IIUC) that this would be a case of
encoding all scripts in bold and italic. The OP protested that it wasn’t, but
he was overheard. That example raises much concern, the more as we were told on
this List that decision makers in UTC are refusing to join in open and public
discussions here, are only “duelling ballot comments.”

Now since regardless of being right in doing so, they did not at all, I’m
plunged again into disarray. May I quote Germaine Tillion, a French ethnologue:
It’s important to understand what happens to us; to understand is to exist. ―
Originally, “to exist” meant “to stand out.” That is still somewhat implied in
the strong sense of “to exist.” Understanding does also help to overcome.
That’s why I wrote one e‑mail before:

Nothing happens, or does not happen, without a good reason.
Finding out what reason is key to recoverage.
If we want to get what we need, we must do our homework first.

Thanks for helping bring it to the point.

Kind regards,

Marcel

Re: NNBSP


  
  
Marcel,
about your many detailed *technical* questions about the history
  of character properties, I am afraid I have no specific
  recollection.
French is not the only language that uses a space to group
  figures. In fact, I grew up with thousands separators being
  spaces, but in much of the existing publications or documents
  there was certainly a full (ordinary) space being used. Not
  surprisingly, because in those years documents were typewritten
  and even many books were simply reproduced from typescript.
When it comes to figures, there are two different types of
  spaces.
One is a space that has the same width a digit and is used in the
  layout of lists. For example, if you have a leading currency
  symbol, you may want to have that lined up on the left and leave
  the digits representing the amounts "ragged". You would fill the
  intervening spaces with this "lining" space character and
  everything lines up.
In lists like that, you can get away with not using a narrow
  thousands separator, because the overall context of the list
  indicates which digits belong together and form a number. Having a
  narrow space may still look nicer, but complicates the space fill
  between the symbol and the digits.
Now for numbers in running text using an ordinary space has
  multiple drawbacks. It's definitely less readable and, in digital
  representation, if you use 0020 you don't communicate that this is
  part of a single number that's best not broken across lines.
The problem Unicode had is that it did not properly understand
  which of the two types of "numeric" spaces was represented by
  "figure space". (I remember that we had discussions on that during
  the early years, but that they were not really resolved and that
  we moved on to other issues, of which many were demanding
  attention).
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language using
  some form of blank as a thousands separator - solving only the
  French issue is not enough. We should not do this a language at a
  time. Do you have colleagues in Germany and other countries that
  can confirm whether their practice matches the French usage in all
  details, or whether there are differences? (Including differently
  acceptability of fallback renderings...).
(2) have a solution that works for lining figures as well as
  separators.
(3) have a solution that understands ALL uses of spaces that are
  narrower than normal space. Once a character exists in Unicode,
  people will use it on the basis of "closest fit" to make it do
  (approximately) what they want. Your proposal needs to address any
  issues that would be caused by reinterpreting a character more
  narrowly that it has been used. Only by comprehensively
  identifying ALL uses of comparable spaces in various languages and
  scripts, you can hope to develop a solution that doesn't simply
  break all non-French text in favor of supporting French
  typography.
Perhaps you see why this issue has languished for so long:
  getting it right is not a simple matter.
A./

Re: NNBSP


  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:


  

  
Covering existing
character sets (National, International and Industry)
was an (not "the") important goal at
the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable
data migration to Unicode as well as enable
Unicode-based systems to process and display non-Unicode
data (by conversion). 
  
  

  
  I’d take this as a touchstone to infer that
there were actual data files including standard typographic
spaces as encoded in U+2000..U+2006, and electronic table layout
using these: “U+2007 figure space has a fixed width,
  known as tabular width, which is the same width as digits used in
  tables. U+2008 punctuation space is a space defined to be the same
  width as a period.” 
  Is that correct?
May I remind you that the beginnings of Unicode predate the
  development of the world wide web. By 1993 the web had developed
  to where it was possible to easily access material written in
  different scripts and language, and by today it is certainly
  possible to "sample" material to check for character usage. 

When Unicode was first developed, it was best to work from the
  definition of character sets and to assume that anything encoded
  in a give set was also used somewhere. Several corporations had
  assembled supersets of character sets that their products were
  supporting. The most extensive was a collection from IBM. (I'm
  blanking out on the name for this).
These collections, which often covered international standard
  character sets as well, were some of the prime inputs into the
  early drafts of Unicode. With the merger with ISO 10646 some
  characters from that effort, but not in the early Unicode drafts,
  were also added.
The code points from U+2000..U+2008 are part of that early
  collection.
Note, that prior to Unicode, no character set standard described
  in detail how characters were to be used (with exception, perhaps
  of control functions). Mostly, it was assumed that users knew what
  these characters were and the function of the character set was
  just to give a passive enumeration.
Unicode's character property model changed all that - but that
  meant that properties for all of the characters had to be
  determined long after they were first encoded in the original
  sources, and with only scant hints of the identity of what these
  were intended to be. (Often, the only hint was a character name
  and a rather poor bitmapped image).
If you want to know the "legacy" behavior for these characters,
  it is more useful, therefore, to see how they have been supported
  in existing software, and how they have been used in documents
  since then. That gives you a baseline for understanding whether
  any change or clarification of the properties of one of these code
  points will break "existing practice".
Breaking existing practice should be a dealbreaker, no matter how
  well-intentioned a change is. The only exception is where existing
  implementations are de-facto useless, because of glaring
  inconsistencies or other issues. In such exceptional cases,
  deprecating some interpretations of  character may be a net win.
However, if there's a consensus interpretation of a given
  character the you can't just go in and change it, even if it would
  make that character work "better" for a given circumstance: you
  simply don't know (unless you research widely) how people have
  used that character in documents that work for them. Breaking
  those documents retroactively, is not acceptable.
A./

Re: NNBSP


  
  
On 1/18/2019 7:27 AM, Marcel Schneider
  via Unicode wrote:

I understand only better
  why a significant majority of UTC is hating French.
  
Francophobia is also palpable in Canada, beyond any
technical reasons, especially in the IT industry. Hence the
position of UTC is far from isolated. If ethic and personal
considerations inflect decision-making, they should consistently
be an integral part of discussions here. In that vein, I’d
mention that by the time when Unicode was developed, there was a
global hatred against France, that originated in French colonial
and foreign politics since WWII, and was revived a few years ago
by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟
and killing the crew’s photographer, in the port of Auckland.
That crime triggered a peak of anger.
Again, my recollections do not support
any issues of Francophobia.
The Unicode Technical committee has always
had French people on board, from the beginning, and I have
witnessed no issues where they took up a different technical
position based on language. Quite the opposite, the UTC
generally appreciates when someone can provide native insights
into the requirements for supporting a given language. How best
to realize these requirements then becomes a joint effort.
  
If anything, the Unicode Consortium saw itself from the beginning
  in contrast to an IT culture for which internationalization at
  times was still something of an afterthought.
Given all that, I find your suggestions and  implications deeply
  hurtful and hope you will find a way to avoid a repetition in the
  future.
May I suggest that trying to rake over the past and apportion
  blame is generally less productive than moving forward and
  addressing the outstanding problems.
A./

Re: NNBSP

On 17/01/2019 20:11, 梁海 Liang Hai via Unicode wrote:

[Just a quick note to everyone that, I’ve just subscribed to this public list,
and will look into this ongoing Mongolian-related discussion once I’ve mentally
recovered from this week’s UTC stress. :)]

Welcome to Unicode Public.

Hopefully this discussion helps sort things out so that we’ll know both what to
do wrt Mongolian and what to do wrt French.

On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode mailto:unicode@unicode.org>> wrote:

On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:

[On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:]

[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian
was discussed for encodinc in the UCS. The problem is that the initial rush for French
was made in a period where Unicode and ISO were competing and not in sync, so no
agreement could be found, until there was a decision to merge the efforts. Tge early rush
was in ISO still not using any character model but a glyph model, with little desire to
support multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to unify the
existing vendor character sets which were already implemented by a limited set of
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still
using another unrelated technology). Font standards were still not existing and were
competing in incompatible ways, all was a mess at that time, so publishers were still
required to use proprietary software solutions, with very low interoperability (at that
time the only "standard" was PostScript, not needing any character encoding at
all, but only encoding glyphs!)

Thank you for this insight. It is a still untold part of the history of Unicode.

This historical summary does *not *square in key points with my own
recollection (I was there). I would therefore not rely on it as if gospel truth.

In particular, one of the key technologies that _brought industry partners to
cooperate around Unicode_ was font technology, in particular the development of
the /TrueType /Standard. I find it not credible that no typographers were part
of that project :).

It is probably part of the (unintentional) fake blames spread by the cited
author’s paper. My apologies for not sufficiently assessing the reliability of
my sources. I’d already identified a number of errors but wasn’t savvy enough
for seeing the other one reported by Richard Wordingham. Now the paper ends up
as a mere libel. It doesn’t mention the lack of NNBSP, instead it piles up a
bunch of gratuitous calumnies. Should that be the prevailing mood of average
French professionals with respect to Unicode ― indeed Patrick Andries is the
only French tech writer on Unicode I found whose work is acclaimed, the others
are either disliked or silent (or libellists) ― then I understand only better
why a significant majority of UTC is hating French.

The statement: "there was initially no desire to encode all the languages and
scripts" is categorically false.

Though Unicode was designed as being limited to a 65 000 characters, and it was
stated that historic scripts were out of scope, only livi

Re: NNBSP

2019-01-17 Thread Richard Wordingham via Unicode

On Thu, 17 Jan 2019 18:35:49 +0100
Marcel Schneider via Unicode  wrote:

> Among the grievances, Unicode is blamed for confusing Greek psili and
> dasia with comma shapes, and for misinterpreting Latin letter forms
> such as the u with descender taken for a turned h, and double u
> mistaken for a turned m, errors that subsequently misled font
> designers to apply misplaced serifs.

And I suppose that the influence was so great that it travelled back in
time to 1976, affecting the typography of the Pelican book 'Phonetics'
as reprinted in 1976.

Those IPA characters originated in a tradition where new characters had
been derived by rotating other characters so as to avoid having to have
new type cut.  Misplaced serifs appear to be original.

Richard.

Re: NNBSP

2019-01-17 Thread 梁海 Liang Hai via Unicode

[Just a quick note to everyone that, I’ve just subscribed to this public list, 
and will look into this ongoing Mongolian-related discussion once I’ve mentally 
recovered from this week’s UTC stress. :)]

Best,
梁海 Liang Hai
https://lianghai.github.io

> On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode  
> wrote:
> 
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>>> [quoted mail]
>>> 
>>> But the French "espace fine insécable" was requested long long before 
>>> Mongolian was discussed for encodinc in the UCS. The problem is that the 
>>> initial rush for French was made in a period where Unicode and ISO were 
>>> competing and not in sync, so no agreement could be found, until there was 
>>> a decision to merge the efforts. Tge early rush was in ISO still not using 
>>> any character model but a glyph model, with little desire to support 
>>> multiple whitespaces; on the Unicode side, there was initially no desire to 
>>> encode all the languages and scripts, focusing initially only on trying to 
>>> unify the existing vendor character sets which were already implemented by 
>>> a limited set of proprietary vendor implementations (notably IBM, 
>>> Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
>>> including the existing ISO 8859-*, GBK, and some national standard or de 
>>> facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this 
>>> time but still using another unrelated technology). Font standards were 
>>> still not existing and were competing in incompatible ways, all was a mess 
>>> at that time, so publishers were still required to use proprietary software 
>>> solutions, with very low interoperability (at that time the only "standard" 
>>> was PostScript, not needing any character encoding at all, but only 
>>> encoding glyphs!)
>> 
>> Thank you for this insight. It is a still untold part of the history of 
>> Unicode.
> This historical summary does not square in key points with my own 
> recollection (I was there). I would therefore not rely on it as if gospel 
> truth.
> 
> In particular, one of the key technologies that brought industry partners to 
> cooperate around Unicode was font technology, in particular the development 
> of the TrueType Standard. I find it not credible that no typographers were 
> part of that project :).
> 
> Covering existing character sets (National, International and Industry) was 
> an (not "the") important goal at the time: such coverage was understood as a 
> necessary (although not sufficient) condition that would enable data 
> migration to Unicode as well as enable Unicode-based systems to process and 
> display non-Unicode data (by conversion). 
> 
> The statement: "there was initially no desire to encode all the languages and 
> scripts" is categorically false.
> 
> (Incidentally, Unicode does not "encode languages" - no character encoding 
> does).
> 
> What has some resemblance of truth is that the understanding of how best to 
> encode whitespace evolved over time. For a long time, there was a confusion 
> whether spaces of different width were simply digital representations of 
> various metal blanks used in hot metal typography to lay out text. As the 
> placement of these was largely handled by the typesetter, not the author, it 
> was felt that they would be better modeled by variable spacing applied 
> mechanically during layout, such as applying indents or justification.
> 
> Gradually it became better understood that there was a second use for these: 
> there are situations where some elements of running text have a gap of a 
> specific width between them, such as a figure space, which is better treated 
> like a character under authors or numeric formatting control than something 
> that gets automatically inserted during layout and rendering.
> 
> Other spaces were found best modeled with a minimal width, subject to 
> expansion during layout if needed.
> 
> 
> 
> There is a wide range of typographical quality in printed publication. The 
> late '70s and '80s saw many books published by direct photomechanical 
> reproduction of typescripts. These represent perhaps the bottom end of the 
> quality scale: they did not implement many fine typographical details and 
> their prevalence among technical literature may have impeded the 
> understanding of what character encoding support would be needed for true 
> fine typography. At the same time, Donald Knuth was refining TeX to restore 
> high quality digital typography, initially for mathematics.
> 
> However, TeX did not have an underlying character encoding; it was using a 
> completely different model mediating between source data and final output. 
> (And it did not know anything about typography for other writing systems).
> 
> Therefore, it is not surprising that it took a while and a few false starts 
> to get the encoding model correct for space characters.
> 
> Hopefully, well

Re: NNBSP

2019-01-17 Thread Asmus Freytag via Unicode


  
  
On 1/17/2019 9:35 AM, Marcel Schneider
  via Unicode wrote:


  
[quoted mail]
  


But the French "espace fine insécable" was requested
  long long before Mongolian was discussed for encodinc in
  the UCS. The problem is that the initial rush for French
  was made in a period where Unicode and ISO were competing
  and not in sync, so no agreement could be found, until
  there was a decision to merge the efforts. Tge early rush
  was in ISO still not using any character model but a glyph
  model, with little desire to support multiple whitespaces;
  on the Unicode side, there was initially no desire to
  encode all the languages and scripts, focusing initially
  only on trying to unify the existing vendor character sets
  which were already implemented by a limited set of
  proprietary vendor implementations (notably IBM,
  Microsoft, HP, Digital) plus a few of the registered
  chrsets in IANA including the existing ISO 8859-*, GBK,
  and some national standard or de facto standards (Russia,
  Thailand, Japan, Korea).
This early rush did not involve typographers (well
  there was Adobe at this time but still using another
  unrelated technology). Font standards were still not
  existing and were competing in incompatible ways, all was
  a mess at that time, so publishers were still required to
  use proprietary software solutions, with very low
  interoperability (at that time the only "standard" was
  PostScript, not needing any character encoding at all, but
  only encoding glyphs!)
  

  
  
  Thank you for this insight. It is a still untold part of the
  history of Unicode.
This historical summary does not square
in key points with my own recollection (I was there). I would
therefore not rely on it as if gospel truth.
  
In particular, one of the key technologies
that brought industry partners to cooperate around Unicode
was font technology, in particular the development of the TrueType
Standard. I find it not credible that no typographers were
part of that project :).
Covering existing character sets (National,
International and Industry) was an (not "the") important
goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data
migration to Unicode as well as enable Unicode-based systems to
process and display non-Unicode data (by conversion). 
  
The statement: "there was initially no
desire to encode all the languages and scripts" is categorically
false.
(Incidentally, Unicode does not "encode
languages" - no character encoding does).
What has some resemblance of truth is that
the understanding of how best to encode whitespace evolved over
time. For a long time, there was a confusion whether spaces of
different width were simply digital representations of various
metal blanks used in hot metal typography to lay out text. As
the placement of these was largely handled by the typesetter,
not the author, it was felt that they would be better modeled by
variable spacing applied mechanically during layout, such as
applying indents or justification.
  
Gradually it became better understood that
there was a second use for these: there are situations where
some elements of running text have a gap of a specific width
between them, such as a figure space, which is better treated
like a character under authors or numeric formatting control
than something that gets automatically inserted during layout
and rendering.
Other spaces were found best modeled with a
minimal width, subject to expansion during layout if needed.

  
There is a wide range of typographical
quality in printed publication. The late '70s and '80s saw many
books published by direct photomechanical reproduction of
typescripts. These represent perhaps the bottom end of the
quality scale: they did not implement many fine typographical
details and their prevalence among technical literature may have
impeded the understanding of what character encoding support
would be needed for true fine typography. At the same time,
Donald Knuth was refining TeX to restore high quality digital
typography, initially for mathematics.
However, TeX did not have an underlying
character encoding; it was using a completely different model

Re: NNBSP (was: A last missing link for interoperable representation)


On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)


Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they 
have no computer science training, and because they were feared as trying to 
enforce requirements that Unicode were neither able nor willing to meet, such 
as distinct code points for italics, bold, small caps…

Among the grievances, Unicode is blamed for confusing Greek psili and dasia 
with comma shapes, and for misinterpreting Latin letter forms such as the u 
with descender taken for a turned h, and double u mistaken for a turned m, 
errors that subsequently misled font designers to apply misplaced serifs. 
Things were done in a hassle and a hurry, under the Damokles sword of a hostile 
ISO messing and menacing to unleash an unusable standard if Unicode wasn’t 
quicker.


If publishers had been involded, they would have revealed that they all needed 
various whitespaces for correct typography (i.e. layout). Typographs themselves 
did not care about whitespaces because they had no value for them (no glyph to 
sell).


Nevertheless the whole range of traditional space forms was admitted, despite 
they were going to be of limited usability. And they were given properties.
Or can’t the misdefinition of PUNCTUATION SPACE be backtracked to that era?


Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like 
Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required 
us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, 
dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained 
their typists or ads sellers to use it (that character was not "sold" in classified ads, 
it was necessary for correct layout, notably in narrow columns, not using it confused the readers 
(notably for the ":" colon): it had to be non-breaking, non-expanding by justification, 
narrower than digits and even narrower than standard non-justified whitespace, and was consistently 
used as a decimal grouping separator.


No doubt they were confident that when an UCS is set up, such an important 
character wouldn’t be skipped.
So confident that they never guessed that they had a key role in reviewing, in 
providing feedback, in lobbying.
Too bad that we’re still so few people today, corporate vetters included, 
despite much things are still going wrong.


But at that time the most common OSes did not support it natively because there 
was no vendor charset supporting it (and in fact most OSes were still unable to 
render proportional fonts everywhere and were frequently limited to 8-bit 
encodings (DOS, Windows, Unix(es), and even Linux at its early start).


Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren’t ready. Not 
even the NNBSP.

Perhaps it’s the poetic ‘justice of time’ that since Unicode is on, the 
Vietnamese are the foremost, and the French the hindmost.
[I’m alluding to the early lobbying of Vietnam for a comprehensive set of 
precomposed letters, while French wasn’t even granted to come into the benefit 
of the NNBSP – that according to PRI #308 [1] is today the only known use of 
NNBSP outside Mongolian – and a handful ordinal indicators (possibly along with 
the rest of the alphabet, except q).

[1] “The only other widely noted use for U+202F NNBSP is for representation of th

Re: NNBSP (was: A last missing link for interoperable representation)


On 17/01/2019 14:36, I wrote:

[…]
The only thing that searches have brought up


It was actually the best thing. Here’s an even more surprising hit:

   B. In the rules, allow these characters to bridge both 
alphabetic and numeric words, with:

 * Replace MidLetter by (MidLetter | MidNumLet)
 * Replace MidNum by (MidNum | MidNumLet)


   -

   4. In addition, the following are also sometimes used, or could 
be used, as numeric separators (we don't give much guidance as to the best 
choice in the standard):

   |0020 |( ) 
SPACE
   |00A0 |(   
) NO-BREAK SPACE
   |2007 |(   
) FIGURE SPACE
   |2008 |(   
) PUNCTUATION SPACE
   |2009 |(   
) THIN SPACE
   |202F |(   
) NARROW NO-BREAK SPACE

   If we had good reason to believe that if one of these only 
really occurred between digits in a single number, then we could add it. I 
don't have enough information to feel like a proposal for that is warranted, 
but others may. Short of that, we should at least document in the notes that 
some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn’t Unicode wish to sort out 
which one of these is the group separator?

1. SPACE: is breakable, hence exit.
2. NO-BREAK SPACE: is justifying, hence exit.
3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
4. PUNCTUATION SPACE: has been left breakable against all reason and evidence 
and consistency, hence exit…
5. THIN SPACE: is part of the breakable spaces series, hence exit.
6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey 35 
all is questioned again, must be assessed, may impact implementations, while 
all other locales using space are still impacted by bad display using NO-BREAK 
SPACE.

I know we have another public Mail List for that, but I feel it’s important to 
submit this to a larger community for consideration and eventually, for 
feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt

Re: NNBSP (was: A last missing link for interoperable representation)


On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS.


Then we should be able to read its encoding proposal in the UTC document 
registry, but Google Search seems unable to retrieve it, so there is a big risk 
that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX #14 
that I’ve quoted in the parent thread has been added by a Unicode Technical 
Director not mentioned in the author field, and that he did it on request from 
two gentlemen whose first names only are cited. I’m sure their full names are 
Martin J. Dürst and Patrick Andries, but I may be wrong.

I apologize for the comment I’ve made in my e‑mail. Still it would be good to 
learn why the French use of NNBSP is sort of taken with a grain of salt, while 
all involved parties were knowing that this NNBSP was (as it still is) the only 
Unicode character ever encoded able to represent the so-long-asked-for “espace 
fine insécable.”

There is also another question I’m asking since a while: Why the character U+2008 
PUNCTUATION SPACE wasn’t given the line break property value "GL" like its 
sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as “2007-08-08”. Why was the Core 
Specification not updated in sync, but only a 7 years later? And was Unicode 
aware that this whitespace is hated by the industry to such an extent that a 
major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to 
beg for font support?


Regards,

Marcel


The problem is that the initial rush for French was made in a period where 
Unicode and ISO were competing and not in sync, so no agreement could be found, 
until there was a decision to merge the efforts. Tge early rush was in ISO 
still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no 
desire to encode all the languages and scripts, focusing initially only on 
trying to unify the existing vendor character sets which were already 
implemented by a limited set of proprietary vendor implementations (notably 
IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
including the existing ISO 8859-*, GBK, and some national standard or de facto 
standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
and was consistently used as a decimal grouping separator.


But at that time the most common OSes did not support it natively because there was no 
vendor charset supporting it (and in fact most OSes were still unable to render 
proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, 
Windows, Unix(es), and even Linux at its early start). So intermediate solution was 
needed. Us chose not to use at all the non-breakable thin space because in English it was 
not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for 
everything (but including its own national symbol for the "$", competing with 
other ISO 646 variants). There were tons of legacy applications developed ince decenials 
that did not support anything else and interoperability in US was available ony with 
ASCII, everything else was unreliable.

If you remember the early years w

Re: NNBSP


On 17/01/2019 09:58, Richard Wordingham wrote:


On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:


Also, at least one French typographer was extremely upset
about Unicode not gathering feedback from typographers.
That blame is partly wrong since at least one typographer
was and still is present in WG2, and even if not being a
Frenchman (but knowing French), as an Anglophone he might
have been aware of the most outstanding use case of NNBSP
with English (both British and American) quotation marks
when a nested quotation starts or ends a quotation, where
_‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
least with proportional fonts.


There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.


Thanks, I didn’t know that this is already implemented. Sometimes one can
read in discussions that the issue is dismissed to font level. That looked
always utopistic to me, the more as people are trained to type spaces when
bringing in former typewriting expertise, and I always believed that it’s
a way for helpless keyboard layout designers to hand the job over.

Turns out there is more to it. But the high-end solution notwithstanding,
the use of an extra space character is recommended practice:

https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html

The source sums up in an overview: “_The Associated Press Stylebook_
recommends a thin space, whereas _The Gregg Reference Manual_ promotes a
full space between the quotation marks. _The Chicago Manual of Style_ says
no space is necessary but adds that a space or a thin space can be inserted
as ‘a typographical nicety.’ ” The author cites three other manuals for not
having retrieved any locus about the topic in them.

We note that all three style guides seem completely unconcerned with
non-breakability. Not so the author of the blog post: “[…] If your software
moves the double quotation mark to the next line of type, use a nonbreaking
space between the two marks to keep them together.” Certainly she would
recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard
or if the software provided a handy shortcut by default.



This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)


Another drawback is that most environments don’t provide OpenType support,
and that the whole scheme depends on language tags that could easily got
lost, and that the issue as being particular to French would quickly boil
down to dismiss support as not cost-effective, arguing that *if* some
individual locale has special requirements for punctuation layout, its
writers are welcome to pick an appropriate space from the UCS and key it
in as desired.

The same is also observed about Mongolian. Today, the preferred approach
for appending suffixes is to encode a Mongolian Suffix Connector to make
sure the renderer will use correct shaping, and to leave the space to the
writer’s discretion. That looks indeed much better than to impose a hard
space that unveiled itself as cumbersome in practice, and that is
reported to often get in the way of a usable text layout.

The problems related to NNBSP as encountered in Mongolian are completely
absent when NNBSP is used with French punctuation or as the regular
group separator in numbers. Hence I’m sure that everybody on this List
agrees in discouraging changes made to the character properties of NNBSP,
such as switching the line breaking class (as "GL" is non-tailorable), or
changing general category to Cf, which could be detrimental to French.

However we need to admit that NNBSP is basically not a Latin but a
Mongolian space, despite being readily attracted into Western typography.
A similar disturbance takes place in word processors, where except in
Microsoft Word 2013, the NBSP is not justifying as intended and as it is
on the web. It’s being hacked and hijacked despite being a bad compromise,
for the purpose of French punctuation spacing. That tailoring is in turn
very detrimental to Polish users, among others, who need a justifying
no-break space for the purpose of prepending one-letter prepositions.

Fortunately a Polish user found and shared a workaround using the string
, the latter being still used in lieu of WORD JOINER as
long as Word keeps unsupporting latest TUS (an issue that raised concern
at Microsoft when it was reported, and will probably be fixed or has
already been fixed meanwhile).



Another spacing m

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Philippe Verdy via Unicode

Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode  wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.

But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nig

Re: NNBSP (was: A last missing link for interoperable representation)