Re: NNBSP

Marcel Schneider via Unicode Fri, 18 Jan 2019 23:37:29 -0800

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:

On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:


Marcel,

about your many detailed *technical* questions about the history of character 
properties, I am afraid I have no specific recollection.

Other List Members are welcome to join in, many of whom are aware of how things 
happened. My questions are meant to be rather simple. Summing up the premium 
ones:

 1. Why does UTC ignore the need of a non-breakable thin space?
 2. Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with 
proportional advance width were used to write books ready for print.

Another question you do answer below:

French is not the only language that uses a space to group figures. In fact, I 
grew up with thousands separators being spaces, but in much of the existing 
publications or documents there was certainly a full (ordinary) space being 
used. Not surprisingly, because in those years documents were typewritten and 
even many books were simply reproduced from typescript.

When it comes to figures, there are two different types of spaces.

One is a space that has the same width a digit and is used in the layout of lists. For example, if 
you have a leading currency symbol, you may want to have that lined up on the left and leave the 
digits representing the amounts "ragged". You would fill the intervening spaces with this 
"lining" space character and everything lines up.

That is exactly how I understood hot-metal typesetting of tables. What 
surprises me is why computerized layout does work the same way instead of using 
tabulations and appropriate tab stops (left, right, centered, decimal [with all 
decimal separators lining up vertically).


==> At the time Unicode was first created (and definitely before that, during the time of 
non-universal character sets) many applications existed that used a "typewriter 
model" and worked by space fill rather than decimal-point tabulation.

If you are talking about applications, as opposed to typesetting tables for 
book printing, then I’d suggest that the fixed-width display of tables could be 
done much like still today’s source code layout, where normal space is used for 
that purpose. In this use case, line wrap is typically turned off. That could 
make non-breakable spaces sort of pointless (but I’m aware of your point 
below), except if people are expected to re-use the data in other environments. 
In that case, best practice is to use NNBSP as thousands separator while 
displaying it like other monospace characters. That’s at least how today’s 
monospace fonts work (provided they’re used in environments actually supporting 
Unicode, which may not happen with applications running in terminal).


From today's perspective that older model is inflexible and not the best 
approach, but it is impossible to say how long this legacy approach hung on in 
some places and how much data might exist that relied on certain long-standing 
behaviors of these space characters.

My position since some time is that legacy apps should use legacy libraries. 
But I’ll come back on this when responding to Shawn Steele.


For a good solution, you always need to understand

(1) the requirement of your "index" case (French, in this case)

That’s okay.


(2) how it relates to similar requirements in (all!) other languages / scripts

That’s rather up to CLDR as I suggested, given it has the means to submit a 
point to all vetters. See again below (in the part that you’ve cut off without 
consideration).


(3) how it relates to actual legacy practice

That’s Shawn Steele’s point (see next reply).


(3a) what will suddenly no longer work if you change the properties on some 
character

(3b) what older data will no longer work if the effective behavior of newer 
applications changes

I’ll already note that this needs to be aware of actual use cases and/or to 
delve into the OSes, that is far beyond what I can currently do, both wrt time 
and wrt resources. The vetter’s role is to inform CLDR with correct data from 
their locale. CLDR is then welcome to sort things out and to get in touch with 
the industry, which CLDR TC is actually doing. But that has no impact on the 
data submitted at survey time. Changing votes to tell “OK let the group 
separator be NBSP as long as…” would be a lie.

In lists like that, you can get away with not using a narrow thousands 
separator, because the overall context of the list indicates which digits 
belong together and form a number. Having a narrow space may still look nicer, 
but complicates the space fill between the symbol and the digits.

It does not, provided that all numbers have thousands separators, even if 
filling with spaces. It looks nicer because it’s more legible.


Now for numbers in running text using an ordinary space has multiple drawbacks. 
It's definitely less readable and, in digital representation, if you use 0020 
you don't communicate that this is part of a single number that's best not 
broken across lines.

Right.


The problem Unicode had is that it did not properly understand which of the two types of 
"numeric" spaces was represented by "figure space". (I remember that we had 
discussions on that during the early years, but that they were not really resolved and that we 
moved on to other issues, of which many were demanding attention).

You were discussing whether the thousands separator should have the width of a 
digit or the width of a period? Consistently with many other choices, the 
solution would have been to encode them both as non-breakable, the more as both 
were at hand, leaving the choice to the end-user.


==> Right, but remember, we started off encoding a set of spaces that existed 
before Unicode (in some other character sets) and implicitly made the assumption 
that those were the correct set (just like we took punctuation from ASCII and 
similar sources and only added to it later, when we understood that they were 
missing things --- generally always added, generally did not redefine behavior or 
shape of existing code points).

Now I understand that what UAX #14 calls “the preferred space for use in 
numbers” is actually preferred in the table layout you are referring to, 
because it is easier to code when only the empty decimal separator position 
uses PUNCTUATION SPACE, while grouping is performed with FIGURE SPACE.

That raises two questions, one of which has been often asked in this thread:

1. How is FIGURE SPACE supposed to be supported in legacy environments? (UAX 
#14 mentions both its line breaking behavior and its width, but makes no 
concessions for legacy apps…)
2. Why did PUNCTUATION SPACE not be declared non-breakable? (If it had, it 
could have been re-purposed to space off French punctuation since the beginning 
of Unicode, and never French users had have a reason to be upset by lack of a 
narrow non-breaking space.)


Current practice in electronic publishing was to use a non-breakable thin 
space, Philippe Verdy reports. Did that information come in somehow?


==> probably not in the early days. Y

Perhaps it was ignored from the beginning on, like Philippe Verdy reports that 
UTC ignored later demands, getting users upset.
That leaves us with the question why it did so, downstream your statement that 
it was not what I ended up suspecting.

Does "Y" stand for the peace symbol?


ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally 
understood that the thousands separator should not have the width of a digit. 
The allaged reason is security. Though on a typewriter, as you state, there is 
scarcely any other option. By that time, all computerized text was fixed width, 
Philippe Verdy reports. On-screen, I figure out, not in book print


==> much book printing was also done by photomechanically reproducing 
typescript at that time. Not everybody wanted to pay typesetters and digital 
typesetting wasn't as advanced. I actually did use a digital phototypesetter of 
the period a few years before I joined Unicode, so I know. It was more powerful 
than a typewriter, but not as powerful as TeX or later the Adobe products.

For one, you didn't typeset a page, only a column of text, and it required 
manual paste-up etc.

Did you also see typewriters with proportional advance width (and 
interchangeable type wheels)? That was the high end on the typewriter market. 
(Already mentioned these typewriters in a previous e‑mail.) Books typeset this 
way could use bold and (less easy) italic spans.

If you want to do the right thing you need:

(1) have a solution that works as intended for ALL language using some form of 
blank as a thousands separator - solving only the French issue is not enough. 
We should not do this a language at a time.

That is how CLDR works.


CLDR data is by definition per-language. Except for inheritance, languages are 
independent.

There are no "French" characters. When you encode characters, at best, some 
code points may be script-specific. For punctuation and spaces not even that may be the 
case. Therefore, as long as you try to solve this as if it *only* was a French problem, 
you are not doing proper character encoding.

Again, I did not do that (and BTW CLDR is not doing “character encoding”). 
Actually, to be able to post that blame you needed to cut off all the URLs I 
provided you with. These links are documenting that i did not “try to solve 
this as if it only was a French problem[.]”

Here they are again, this time with copy-pasted snippets below.
I wrote: “But as soon as that was set up, I started lobbying for support of all 
relevant locales at once:”

https://unicode.org/cldr/trac/ticket/11423
https://unicode.org/pipermail/cldr-users/2018-September/000842.html

 * “To be cost-effective, locales using space as numbers group separator should 
migrate at once from the wrong U+00A0 to the correct U+202F. I didn’t aim at 
making French stand out, but at correcting an error in CLDR. Having even the 
Canadian French sublocale stick with the wrong value makes no sense and is 
mainly due to opaque inheritance relationships and to severe constraints on 
vetters applying for fr-FR and subsequently reduced to look on helpless from 
the sidelines when sublocales are not getting fixed.”

 * “After having painstakingly catched up support of some narrow fixed-width 
no-break space (U+202F). the industry is now ready to migrate from U+00A0 to 
U+202F. Doing it in a single rush is way more cost-effective than migrating one 
locale this time, another locale next time, a handful locales the time after, 
possibly splitting them up in sublocales with different migration schedules. I 
really believed that now Unicode proves ready to adopt the real group separator 
in French, all relevant locales would be consistently pushed for correcting 
that value in release 34. The v34 alpha overview makes clear they are not. 
   http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

   I aimed at correcting an error in CLDR, not at making French stand out. 
Having many locales and sublocales stick with the wrong value makes no sense 
any more.
   
https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

   The only effect is implementers skipping migration for fr-FR while waiting 
for the others to catch up, then doing it for all at once.

   There seems to be a misunderstanding: The*locale setting *is whether to use period, comma, space, 
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK 
SPACE or NARROW NO-BREAK SPACE is *not a locale setting,* but it’s all about Unicode *design* and Unicode 
*implementation.* I really thought that that was clear and that there’s no need to heavily insist on the ST 
"French" forum. When referring to the "French thousands separator" I only meant that 
unlike comma- or period-using locales, the French locale uses space and that the group separator space should 
be the correct one. That did *not* mean that French should use *another* space than the other locales using 
space.”

https://unicode.org/pipermail/cldr-users/2018-September/000843.html
and
https://unicode.org/cldr/trac/ticket/11423#comment:2

 * “I've to confess that I did focus on French and only applied for fr-FR, but 
there was a lot of work, see 
   http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth
   waiting for very few vetters. Nevertheless I also cared for English (see 
various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA 
hadn’t caught up the group separator correction yet:
   https://unicode.org/pipermail/cldr-users/2018-August/000825.html

   Also I’m sorry for failing to provide appropriate feedback after beta 
release and to post upstream messages urging to make sure all locales using 
space for group separator be kept in synchrony.

   I think the point about not splitting up all the data into locales is a very 
good one.

   There should be a common pool so that all locales using Arabic script have 
automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually 
fits all), and those locales using space should only need to specify "space" to 
automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready 
to give it currency in that role.”


Do these recommendations meet your requirements and sound okay to you?


Do you have colleagues in Germany and other countries that can confirm whether 
their practice matches the French usage in all details, or whether there are 
differences? (Including differently acceptability of fallback renderings...).

No I don’t but people may wish to read German Wikipedia:

https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen

Shared in ticket #11423:
https://unicode.org/cldr/trac/ticket/11423#comment:15



==> for your proposal to be effective, you need to reach out.

Basically we vetters are just reporting the locale date. Beyond that, I’ve 
already conceded a huge effort in reporting bugs in English data and in 
communicating on lists and fora, including German (since the current survey 
that has a very limited scope). I have limited time and resources.

Normally reaching out to all relevant locales is what CLDR can do best, by 
posting guidelines. by e-mailing (on behalf of CLDR administrator and/or on the 
public CLDR-users Mail List), and by prioritizing the items on the vetters’ 
dashboards.

If I can do something else, I’m ready but people should not abuse since I’ve 
many other tasks I won’t be going to deprioritize any longer. At some point 
I’ll just start reporting to end-users that we’ve strived to get locale data in 
synch, but that CLDR ended up rolling back our efforts, alleging other 
priorities. If that is what you wish, I’d say that there’s no problem for me 
except that I strongly dislike documenting an ugly mess.

(2) have a solution that works for lining figures as well as separators.

(3) have a solution that understands ALL uses of spaces that are narrower than normal 
space. Once a character exists in Unicode, people will use it on the basis of 
"closest fit" to make it do (approximately) what they want. Your proposal needs 
to address any issues that would be caused by reinterpreting a character more narrowly 
that it has been used. Only by comprehensively identifying ALL uses of comparable spaces 
in various languages and scripts, you can hope to develop a solution that doesn't simply 
break all non-French text in favor of supporting French typography.
There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatarof what PUNCTUATION SPACE should have been since the beginning.


==> I mentioned before that if something is universally "broken" it can 
sometimes be resurrected, because even if you change its behavior retroactively, it will not 
change something that ever worked correctly. (But you need to be sure that nobody repurposed 
the NNBSP for something useful that is different from what you intend to use it for, 
otherwise you can't change anything about it).

You may wish to look up Unicode’s own PRI#308 background page, where they 
already hinted they’ve made sure it isn’t.
http://www.unicode.org/review/pri308/pri308-background.html
https://www.unicode.org/review/pri308/
https://www.unicode.org/review/pri308/feedback.html

If, however, you are merely adding a use for some existing character that does 
not affect its properties, that is usually not as much of a problem - as long 
as we can have some confidence that both usages will continue to be possible.

Actually, again, there is a problem with NNBSP in Mongolian.

Richard Wordingham reported at thread launch that Unicode have started tweaking 
that space in a way that makes it unfit for French.

Now since you are aware that this operating mode is wrong, I’d suggest that you reach back to them providing feedback about inappropriateness of last changes. Other people (including me) may do that as well, but I see better chances for your recommendations to get implemented. I say that because lastly I strongly recommended in several pieces of feedback that the math symbols should not be bidi-mirrored on a tilde–reversed-tilde basis, because mirroring these compromises legibility of the tilde symbol in low-end environments relying on glyph-exchange-bidi-mirroring for best-fit display, but UTC took no action, and off-list I was taught that UTC is not interested. Nothing else than that, in private mail. UTC are just not interested, without providing any technical reasons. Perhaps you better understand now why I posted what I suspected to be the reason why UTC is not interested, or was not interested, in supporting a narrow non-breaking space unless Mongolian was encoded andneeded the same for the purpose of appending suffixes (as opposed to separating vowels, which is performed by a similar space with another shaping behavior, and proper to Mongolian). A hypothesis that you firmly dissipated in the wake, but without answering my question about */why UTC was ignoring the demand for a narrow non-breaking space, delaying support for French and heavily impacting French implementations still today/* due to less font support than if that space were in Unicode from version 1.1 on.

Perhaps you see why this issue has languished for so long: getting it right is 
not a simple matter.

Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was 
made non-breakable. Now we ended up with a mutated Mongolian Space that does 
not work properly for Mongolian, but does for French and other Latin script 
using languages. It would even more if TUS was blunter, urging all foundries to 
update their whole catalogue soon.


==> You realize that I'm giving you general advice here, not something utterly 
specific to NNBSP - I don't have the inputs and background to know whether your 
approach is feasible or perhaps the best possible?

It is not “my approach”.

Other List Members may wish to help you answer my questions.


As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as 
part of the added math support in Unicode 3.2). We need to be sure that the 
assumptions about these that may have been made in math typesetting  are not 
invalidated.

That adds to the reasons why I’m asking why PUNCTUATION SPACE was not made 
non-breakable when FIGURE SPACE was. The math usage has probably originated in 
repurposing that space on the basis of it’s line breaking behavior. I don’t 
suggest to make it non-breakable now. That deal was broken and will remain 
broken. Now we must live with NNBSP and get more font support, while trying to 
stop Unicode from making a mess of it that neither helps Mongolian nor French 
nor all (other) locales grouping digits with a narrow space.


Not sure offhand whether UTR#25 captures all of that, but if you ever feel like 
proposing a property change you MUST research that first (with the current 
maintainers of that UTR or other experts).

I have NOT proposed any property change, and PUNCTUATION SPACE or "2008" are 
NOT found in UTR #25 (Unicode Support for Mathematics).


This is the way Unicode is different from CLDR.

Marcel

Re: NNBSP

Reply via email to