On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:
On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:
On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:
Marcel,
about your many detailed *technical* questions about the history of character
properties, I am afraid I have no specific recollection.
Other List Members are welcome to join in, many of whom are aware of how things
happened. My questions are meant to be rather simple. Summing up the premium
ones:
1. Why does UTC ignore the need of a non-breakable thin space?
2. Why did UTC not declare PUNCTUATION SPACE non-breakable?
A less important information would be how extensively typewriters with
proportional advance width were used to write books ready for print.
Another question you do answer below:
French is not the only language that uses a space to group figures. In fact, I
grew up with thousands separators being spaces, but in much of the existing
publications or documents there was certainly a full (ordinary) space being
used. Not surprisingly, because in those years documents were typewritten and
even many books were simply reproduced from typescript.
When it comes to figures, there are two different types of spaces.
One is a space that has the same width a digit and is used in the layout of lists. For example, if
you have a leading currency symbol, you may want to have that lined up on the left and leave the
digits representing the amounts "ragged". You would fill the intervening spaces with this
"lining" space character and everything lines up.
That is exactly how I understood hot-metal typesetting of tables. What
surprises me is why computerized layout does work the same way instead of using
tabulations and appropriate tab stops (left, right, centered, decimal [with all
decimal separators lining up vertically).
==> At the time Unicode was first created (and definitely before that, during the time of
non-universal character sets) many applications existed that used a "typewriter
model" and worked by space fill rather than decimal-point tabulation.
If you are talking about applications, as opposed to typesetting tables for
book printing, then I’d suggest that the fixed-width display of tables could be
done much like still today’s source code layout, where normal space is used for
that purpose. In this use case, line wrap is typically turned off. That could
make non-breakable spaces sort of pointless (but I’m aware of your point
below), except if people are expected to re-use the data in other environments.
In that case, best practice is to use NNBSP as thousands separator while
displaying it like other monospace characters. That’s at least how today’s
monospace fonts work (provided they’re used in environments actually supporting
Unicode, which may not happen with applications running in terminal).
From today's perspective that older model is inflexible and not the best
approach, but it is impossible to say how long this legacy approach hung on in
some places and how much data might exist that relied on certain long-standing
behaviors of these space characters.
My position since some time is that legacy apps should use legacy libraries.
But I’ll come back on this when responding to Shawn Steele.
For a good solution, you always need to understand
(1) the requirement of your "index" case (French, in this case)
That’s okay.
(2) how it relates to similar requirements in (all!) other languages / scripts
That’s rather up to CLDR as I suggested, given it has the means to submit a
point to all vetters. See again below (in the part that you’ve cut off without
consideration).
(3) how it relates to actual legacy practice
That’s Shawn Steele’s point (see next reply).
(3a) what will suddenly no longer work if you change the properties on some
character
(3b) what older data will no longer work if the effective behavior of newer
applications changes
I’ll already note that this needs to be aware of actual use cases and/or to
delve into the OSes, that is far beyond what I can currently do, both wrt time
and wrt resources. The vetter’s role is to inform CLDR with correct data from
their locale. CLDR is then welcome to sort things out and to get in touch with
the industry, which CLDR TC is actually doing. But that has no impact on the
data submitted at survey time. Changing votes to tell “OK let the group
separator be NBSP as long as…” would be a lie.
In lists like that, you can get away with not using a narrow thousands
separator, because the overall context of the list indicates which digits
belong together and form a number. Having a narrow space may still look nicer,
but complicates the space fill between the symbol and the digits.
It does not, provided that all numbers have thousands separators, even if
filling with spaces. It looks nicer because it’s more legible.
Now for numbers in running text using an ordinary space has multiple drawbacks.
It's definitely less readable and, in digital representation, if you use 0020
you don't communicate that this is part of a single number that's best not
broken across lines.
Right.
The problem Unicode had is that it did not properly understand which of the two types of
"numeric" spaces was represented by "figure space". (I remember that we had
discussions on that during the early years, but that they were not really resolved and that we
moved on to other issues, of which many were demanding attention).
You were discussing whether the thousands separator should have the width of a
digit or the width of a period? Consistently with many other choices, the
solution would have been to encode them both as non-breakable, the more as both
were at hand, leaving the choice to the end-user.
==> Right, but remember, we started off encoding a set of spaces that existed
before Unicode (in some other character sets) and implicitly made the assumption
that those were the correct set (just like we took punctuation from ASCII and
similar sources and only added to it later, when we understood that they were
missing things --- generally always added, generally did not redefine behavior or
shape of existing code points).
Now I understand that what UAX #14 calls “the preferred space for use in
numbers” is actually preferred in the table layout you are referring to,
because it is easier to code when only the empty decimal separator position
uses PUNCTUATION SPACE, while grouping is performed with FIGURE SPACE.
That raises two questions, one of which has been often asked in this thread:
1. How is FIGURE SPACE supposed to be supported in legacy environments? (UAX
#14 mentions both its line breaking behavior and its width, but makes no
concessions for legacy apps…)
2. Why did PUNCTUATION SPACE not be declared non-breakable? (If it had, it
could have been re-purposed to space off French punctuation since the beginning
of Unicode, and never French users had have a reason to be upset by lack of a
narrow non-breaking space.)
Current practice in electronic publishing was to use a non-breakable thin
space, Philippe Verdy reports. Did that information come in somehow?
==> probably not in the early days. Y
Perhaps it was ignored from the beginning on, like Philippe Verdy reports that
UTC ignored later demands, getting users upset.
That leaves us with the question why it did so, downstream your statement that
it was not what I ended up suspecting.
Does "Y" stand for the peace symbol?
ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally
understood that the thousands separator should not have the width of a digit.
The allaged reason is security. Though on a typewriter, as you state, there is
scarcely any other option. By that time, all computerized text was fixed width,
Philippe Verdy reports. On-screen, I figure out, not in book print
==> much book printing was also done by photomechanically reproducing
typescript at that time. Not everybody wanted to pay typesetters and digital
typesetting wasn't as advanced. I actually did use a digital phototypesetter of
the period a few years before I joined Unicode, so I know. It was more powerful
than a typewriter, but not as powerful as TeX or later the Adobe products.
For one, you didn't typeset a page, only a column of text, and it required
manual paste-up etc.
Did you also see typewriters with proportional advance width (and
interchangeable type wheels)? That was the high end on the typewriter market.
(Already mentioned these typewriters in a previous e‑mail.) Books typeset this
way could use bold and (less easy) italic spans.
If you want to do the right thing you need:
(1) have a solution that works as intended for ALL language using some form of
blank as a thousands separator - solving only the French issue is not enough.
We should not do this a language at a time.
That is how CLDR works.
CLDR data is by definition per-language. Except for inheritance, languages are
independent.
There are no "French" characters. When you encode characters, at best, some
code points may be script-specific. For punctuation and spaces not even that may be the
case. Therefore, as long as you try to solve this as if it *only* was a French problem,
you are not doing proper character encoding.
Again, I did not do that (and BTW CLDR is not doing “character encoding”).
Actually, to be able to post that blame you needed to cut off all the URLs I
provided you with. These links are documenting that i did not “try to solve
this as if it only was a French problem[.]”
Here they are again, this time with copy-pasted snippets below.
I wrote: “But as soon as that was set up, I started lobbying for support of all
relevant locales at once:”
https://unicode.org/cldr/trac/ticket/11423
https://unicode.org/pipermail/cldr-users/2018-September/000842.html
* “To be cost-effective, locales using space as numbers group separator should
migrate at once from the wrong U+00A0 to the correct U+202F. I didn’t aim at
making French stand out, but at correcting an error in CLDR. Having even the
Canadian French sublocale stick with the wrong value makes no sense and is
mainly due to opaque inheritance relationships and to severe constraints on
vetters applying for fr-FR and subsequently reduced to look on helpless from
the sidelines when sublocales are not getting fixed.”
* “After having painstakingly catched up support of some narrow fixed-width
no-break space (U+202F). the industry is now ready to migrate from U+00A0 to
U+202F. Doing it in a single rush is way more cost-effective than migrating one
locale this time, another locale next time, a handful locales the time after,
possibly splitting them up in sublocales with different migration schedules. I
really believed that now Unicode proves ready to adopt the real group separator
in French, all relevant locales would be consistently pushed for correcting
that value in release 34. The v34 alpha overview makes clear they are not.
http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration
I aimed at correcting an error in CLDR, not at making French stand out.
Having many locales and sublocales stick with the wrong value makes no sense
any more.
https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d
The only effect is implementers skipping migration for fr-FR while waiting
for the others to catch up, then doing it for all at once.
There seems to be a misunderstanding: The*locale setting *is whether to use period, comma, space,
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK
SPACE or NARROW NO-BREAK SPACE is *not a locale setting,* but it’s all about Unicode *design* and Unicode
*implementation.* I really thought that that was clear and that there’s no need to heavily insist on the ST
"French" forum. When referring to the "French thousands separator" I only meant that
unlike comma- or period-using locales, the French locale uses space and that the group separator space should
be the correct one. That did *not* mean that French should use *another* space than the other locales using
space.”
https://unicode.org/pipermail/cldr-users/2018-September/000843.html
and
https://unicode.org/cldr/trac/ticket/11423#comment:2
* “I've to confess that I did focus on French and only applied for fr-FR, but
there was a lot of work, see
http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth
waiting for very few vetters. Nevertheless I also cared for English (see
various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA
hadn’t caught up the group separator correction yet:
https://unicode.org/pipermail/cldr-users/2018-August/000825.html
Also I’m sorry for failing to provide appropriate feedback after beta
release and to post upstream messages urging to make sure all locales using
space for group separator be kept in synchrony.
I think the point about not splitting up all the data into locales is a very
good one.
There should be a common pool so that all locales using Arabic script have
automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually
fits all), and those locales using space should only need to specify "space" to
automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready
to give it currency in that role.”
Do these recommendations meet your requirements and sound okay to you?
Do you have colleagues in Germany and other countries that can confirm whether
their practice matches the French usage in all details, or whether there are
differences? (Including differently acceptability of fallback renderings...).
No I don’t but people may wish to read German Wikipedia:
https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen
Shared in ticket #11423:
https://unicode.org/cldr/trac/ticket/11423#comment:15
==> for your proposal to be effective, you need to reach out.
Basically we vetters are just reporting the locale date. Beyond that, I’ve
already conceded a huge effort in reporting bugs in English data and in
communicating on lists and fora, including German (since the current survey
that has a very limited scope). I have limited time and resources.
Normally reaching out to all relevant locales is what CLDR can do best, by
posting guidelines. by e-mailing (on behalf of CLDR administrator and/or on the
public CLDR-users Mail List), and by prioritizing the items on the vetters’
dashboards.
If I can do something else, I’m ready but people should not abuse since I’ve
many other tasks I won’t be going to deprioritize any longer. At some point
I’ll just start reporting to end-users that we’ve strived to get locale data in
synch, but that CLDR ended up rolling back our efforts, alleging other
priorities. If that is what you wish, I’d say that there’s no problem for me
except that I strongly dislike documenting an ugly mess.
(2) have a solution that works for lining figures as well as separators.
(3) have a solution that understands ALL uses of spaces that are narrower than normal
space. Once a character exists in Unicode, people will use it on the basis of
"closest fit" to make it do (approximately) what they want. Your proposal needs
to address any issues that would be caused by reinterpreting a character more narrowly
that it has been used. Only by comprehensively identifying ALL uses of comparable spaces
in various languages and scripts, you can hope to develop a solution that doesn't simply
break all non-French text in favor of supporting French typography.
There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar
of what PUNCTUATION SPACE should have been since the beginning.
==> I mentioned before that if something is universally "broken" it can
sometimes be resurrected, because even if you change its behavior retroactively, it will not
change something that ever worked correctly. (But you need to be sure that nobody repurposed
the NNBSP for something useful that is different from what you intend to use it for,
otherwise you can't change anything about it).
You may wish to look up Unicode’s own PRI#308 background page, where they
already hinted they’ve made sure it isn’t.
http://www.unicode.org/review/pri308/pri308-background.html
https://www.unicode.org/review/pri308/
https://www.unicode.org/review/pri308/feedback.html
If, however, you are merely adding a use for some existing character that does
not affect its properties, that is usually not as much of a problem - as long
as we can have some confidence that both usages will continue to be possible.
Actually, again, there is a problem with NNBSP in Mongolian.
Richard Wordingham reported at thread launch that Unicode have started tweaking
that space in a way that makes it unfit for French.
Now since you are aware that this operating mode is wrong, I’d suggest that you reach back to them providing feedback about inappropriateness of last changes. Other people (including me) may do that as well, but I see better chances for your recommendations to get implemented. I say that because lastly I strongly recommended in several pieces of feedback that the math symbols should not be bidi-mirrored on a tilde–reversed-tilde basis, because mirroring these compromises legibility of the tilde symbol in low-end environments relying on glyph-exchange-bidi-mirroring for best-fit display, but UTC took no action, and off-list I was taught that UTC is not interested. Nothing else than that, in private mail. UTC are just not interested, without providing any technical reasons. Perhaps you better understand now why I posted what I suspected to be the reason why UTC is not interested, or was not interested, in supporting a narrow non-breaking space unless Mongolian was encoded and
needed the same for the purpose of appending suffixes (as opposed to separating vowels, which is performed by a similar space with another shaping behavior, and proper to Mongolian). A hypothesis that you firmly dissipated in the wake, but without answering my question about */why UTC was ignoring the demand for a narrow non-breaking space, delaying support for French and heavily impacting French implementations still today/* due to less font support than if that space were in Unicode from version 1.1 on.
Perhaps you see why this issue has languished for so long: getting it right is
not a simple matter.
Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was
made non-breakable. Now we ended up with a mutated Mongolian Space that does
not work properly for Mongolian, but does for French and other Latin script
using languages. It would even more if TUS was blunter, urging all foundries to
update their whole catalogue soon.
==> You realize that I'm giving you general advice here, not something utterly
specific to NNBSP - I don't have the inputs and background to know whether your
approach is feasible or perhaps the best possible?
It is not “my approach”.
Other List Members may wish to help you answer my questions.
As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as
part of the added math support in Unicode 3.2). We need to be sure that the
assumptions about these that may have been made in math typesetting are not
invalidated.
That adds to the reasons why I’m asking why PUNCTUATION SPACE was not made
non-breakable when FIGURE SPACE was. The math usage has probably originated in
repurposing that space on the basis of it’s line breaking behavior. I don’t
suggest to make it non-breakable now. That deal was broken and will remain
broken. Now we must live with NNBSP and get more font support, while trying to
stop Unicode from making a mess of it that neither helps Mongolian nor French
nor all (other) locales grouping digits with a narrow space.
Not sure offhand whether UTR#25 captures all of that, but if you ever feel like
proposing a property change you MUST research that first (with the current
maintainers of that UTR or other experts).
I have NOT proposed any property change, and PUNCTUATION SPACE or "2008" are
NOT found in UTR #25 (Unicode Support for Mathematics).
This is the way Unicode is different from CLDR.
Marcel