Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)


Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they 
have no computer science training, and because they were feared as trying to 
enforce requirements that Unicode were neither able nor willing to meet, such 
as distinct code points for italics, bold, small caps…

Among the grievances, Unicode is blamed for confusing Greek psili and dasia 
with comma shapes, and for misinterpreting Latin letter forms such as the u 
with descender taken for a turned h, and double u mistaken for a turned m, 
errors that subsequently misled font designers to apply misplaced serifs. 
Things were done in a hassle and a hurry, under the Damokles sword of a hostile 
ISO messing and menacing to unleash an unusable standard if Unicode wasn’t 
quicker.


If publishers had been involded, they would have revealed that they all needed 
various whitespaces for correct typography (i.e. layout). Typographs themselves 
did not care about whitespaces because they had no value for them (no glyph to 
sell).


Nevertheless the whole range of traditional space forms was admitted, despite 
they were going to be of limited usability. And they were given properties.
Or can’t the misdefinition of PUNCTUATION SPACE be backtracked to that era?


Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like 
Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required 
us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, 
dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained 
their typists or ads sellers to use it (that character was not "sold" in classified ads, 
it was necessary for correct layout, notably in narrow columns, not using it confused the readers 
(notably for the ":" colon): it had to be non-breaking, non-expanding by justification, 
narrower than digits and even narrower than standard non-justified whitespace, and was consistently 
used as a decimal grouping separator.


No doubt they were confident that when an UCS is set up, such an important 
character wouldn’t be skipped.
So confident that they never guessed that they had a key role in reviewing, in 
providing feedback, in lobbying.
Too bad that we’re still so few people today, corporate vetters included, 
despite much things are still going wrong.


But at that time the most common OSes did not support it natively because there 
was no vendor charset supporting it (and in fact most OSes were still unable to 
render proportional fonts everywhere and were frequently limited to 8-bit 
encodings (DOS, Windows, Unix(es), and even Linux at its early start).


Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren’t ready. Not 
even the NNBSP.

Perhaps it’s the poetic ‘justice of time’ that since Unicode is on, the 
Vietnamese are the foremost, and the French the hindmost.
[I’m alluding to the early lobbying of Vietnam for a comprehensive set of 
precomposed letters, while French wasn’t even granted to come into the benefit 
of the NNBSP – that according to PRI #308 [1] is today the only known use of 
NNBSP outside Mongolian – and a handful ordinal indicators (possibly along with 
the rest of the alphabet, except q).

[1] “The only other widely noted use for U+202F NNBSP is for representation of the 
thin non-breaking space (/espace fine 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 14:36, I wrote:

[…]
The only thing that searches have brought up


It was actually the best thing. Here’s an even more surprising hit:

   B. In the rules, allow these characters to bridge both 
alphabetic and numeric words, with:

 * Replace MidLetter by (MidLetter | MidNumLet)
 * Replace MidNum by (MidNum | MidNumLet)


   -

   4. In addition, the following are also sometimes used, or could 
be used, as numeric separators (we don't give much guidance as to the best 
choice in the standard):

   |0020 |( ) 
SPACE
   |00A0 |(   
) NO-BREAK SPACE
   |2007 |(   
) FIGURE SPACE
   |2008 |(   
) PUNCTUATION SPACE
   |2009 |(   
) THIN SPACE
   |202F |(   
) NARROW NO-BREAK SPACE

   If we had good reason to believe that if one of these only 
really occurred between digits in a single number, then we could add it. I 
don't have enough information to feel like a proposal for that is warranted, 
but others may. Short of that, we should at least document in the notes that 
some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn’t Unicode wish to sort out 
which one of these is the group separator?

1. SPACE: is breakable, hence exit.
2. NO-BREAK SPACE: is justifying, hence exit.
3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
4. PUNCTUATION SPACE: has been left breakable against all reason and evidence 
and consistency, hence exit…
5. THIN SPACE: is part of the breakable spaces series, hence exit.
6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey 35 
all is questioned again, must be assessed, may impact implementations, while 
all other locales using space are still impacted by bad display using NO-BREAK 
SPACE.

I know we have another public Mail List for that, but I feel it’s important to 
submit this to a larger community for consideration and eventually, for 
feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt









Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS.


Then we should be able to read its encoding proposal in the UTC document 
registry, but Google Search seems unable to retrieve it, so there is a big risk 
that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX #14 
that I’ve quoted in the parent thread has been added by a Unicode Technical 
Director not mentioned in the author field, and that he did it on request from 
two gentlemen whose first names only are cited. I’m sure their full names are 
Martin J. Dürst and Patrick Andries, but I may be wrong.

I apologize for the comment I’ve made in my e‑mail. Still it would be good to 
learn why the French use of NNBSP is sort of taken with a grain of salt, while 
all involved parties were knowing that this NNBSP was (as it still is) the only 
Unicode character ever encoded able to represent the so-long-asked-for “espace 
fine insécable.”

There is also another question I’m asking since a while: Why the character U+2008 
PUNCTUATION SPACE wasn’t given the line break property value "GL" like its 
sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as “2007-08-08”. Why was the Core 
Specification not updated in sync, but only a 7 years later? And was Unicode 
aware that this whitespace is hated by the industry to such an extent that a 
major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to 
beg for font support?


Regards,

Marcel


The problem is that the initial rush for French was made in a period where 
Unicode and ISO were competing and not in sync, so no agreement could be found, 
until there was a decision to merge the efforts. Tge early rush was in ISO 
still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no 
desire to encode all the languages and scripts, focusing initially only on 
trying to unify the existing vendor character sets which were already 
implemented by a limited set of proprietary vendor implementations (notably 
IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
including the existing ISO 8859-*, GBK, and some national standard or de facto 
standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
and was consistently used as a decimal grouping separator.


But at that time the most common OSes did not support it natively because there was no 
vendor charset supporting it (and in fact most OSes were still unable to render 
proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, 
Windows, Unix(es), and even Linux at its early start). So intermediate solution was 
needed. Us chose not to use at all the non-breakable thin space because in English it was 
not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for 
everything (but including its own national symbol for the "$", competing with 
other ISO 646 variants). There were tons of legacy applications developed ince decenials 
that did not support anything else and interoperability in US was available ony with 
ASCII, everything else was unreliable.

If you remember the early years when the Internet started to develop outside US, you remember the 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Philippe Verdy via Unicode
Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode  wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nightmare of non-interoperable 8-bit charsets
and the famous "mojibake" we saw everywhere. Then the competition between
ISO and Unicode lasted too long. But it was considered "too late" for
French to change anything (and Windows used 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

Courier New was lacking NNBSP on Windows 7. It is including it on
Windows 10. The tests I referred to were made 2 years ago. I
confess that I was so disappointed to see Courier New unsupporting
NNBSP a decade after encoding, while many relevant people in the
industry were surely aware of its role and importance for French
(at least those keeping a branch office in France), that I gave it
up. Turns out that foundries are delaying support until the usage
is backed by TUS, which happened in 2014, timely for Windows 10.
(I’m lacking hints about Windows 8 and 8.1.)

Superscripts are a handy parallel showcasing a similar process.
As long as preformatted superscripts are outlawed by TUS for use
in the digital representation of abbreviation indicators, vendors
keep disturbing their glyphs with what one could start calling an
intentional metrics disorder (IMD). One can also rank the vendors
on the basis of the intensity of IMD in preformatted superscripts,
but this is not the appropriate thread, and anyhow this List is
not the place. A comment on CLDR ticket #11653 is better.

[…]

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. […]

Marcel


Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-16 Thread Marcel Schneider via Unicode

On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:


On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode  wrote:


If your fonts behave incorrectly on your system because it does not
map any glyph for NNBSP, don't blame the font or Unicode about this
problem, blame the renderer (or the application or OS using it, may
be they are very outdated and were not aware of these features, theyt
are probably based on old versions of Unicode when NNBSP was still
not present even if it was requested since very long at least for
French and even English, before even Unicode, and long before
Mongolian was then encoded, only in Unicode and not in any known
supported legacy charset: Mongolian was specified by borrowing the
same NNBSP already designed for Latin, because the Mongolian space
had no known specific behavior: the encoded whitespaces in Unicode
are compeltely script-neutral, they are generic, and are even
BiDi-neutral, they are all usable with any script).


The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.


Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
more. When Unicode argued in favor of a unification with , this was
pointed as impracticable, and the need of a specific Mongolian space for
the purpose of appending suffixes was underscored. Only in London in
September 1998 it was agreed that “The Mongolian Space is retained but
moved to the general punctuation block and renamed ‘Narrow No Break Space’ ”.

However, unlike for the Mongolian Combination Symbols sequencing a question
and exclamation mark both ways, a concrete rationale as of how useful the
 could be in other scripts doesn’t seem to be put on the table when
the move to General Punctuation was decided.



Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:


As a side-note: The relevant text of TUS doesn’t predate version 11 (2018).



1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.


I don’t believe that these additions to TUS are in any way able to fix
the many issues with  in Mongolian causing so much headache and
ending up in a unanimous desire to replace  with a *new*
*MONGOLIAN SUFFIX CONNECTOR. Indeed some suffixes are as long as 7 letters,
e.g. “ ᠲᠠᠶᠢᠭᠠᠨ ”

https://lists.w3.org/Archives/Public/public-i18n-mongolian/2015JulSep/att-0036/DS05_Mongolian_NNBSP_Connected_Suffixes.pdf



Or is property (3) appropriate for French?


No it isn’t. It only introduces new flaws for a character that,
despite being encoded for Mongolian with specific handling intended,
was readily ripped off for use in French, Philippe Verdy reported,
to that extent that it is actually an encoding error in Mongolian
that brought the long-missing narrow non-breakable thin space into
the UCS, in the block where it really belongs to, and where it had
been encoded in the beginning if there had been no desire to keep
it proprietary.

That is the hidden (almost occult) fact where stances like “The
NNBSP can be used to represent the narrow space occurring around
punctuation characters in French typography, which is called an
‘espace fine insécable.’ ” (TUS) and “When NARROW NO-BREAK SPACE
occurs in French text, it should be interpreted as an ‘espace fine
insécable’.” (UAX #14) are stemming from. The underlying meaning
as I understand it now is like: “The non-breakable thin space is
usually a vendor-specific layout control in DTP applications; it’s
also available via a TeX command. However, if you are interested
in an interoperable representation, here’s a Unicode character you
can use instead.”

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. I’ll need to use
a font manager to output a complete list wrt NNBSP support.

I’m utterly worried about the fate of the non-breaking thin
space in Unicode, and I wonder why the French and Canadian
French people present at setup – either on Unicode side or
on JTC1/SC2/WG2 side – didn’t get this character encoded in
the initial rush. Did they really sell themselves and their
locales to DTP lobbyists? Or were they tricked out?

Also, at least one 

NNBSP (was: A last missing link for interoperable representation)

2019-01-16 Thread Richard Wordingham via Unicode
On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode  wrote:

> If your fonts behave incorrectly on your system because it does not
> map any glyph for NNBSP, don't blame the font or Unicode about this
> problem, blame the renderer (or the application or OS using it, may
> be they are very outdated and were not aware of these features, theyt
> are probably based on old versions of Unicode when NNBSP was still
> not present even if it was requested since very long at least for
> French and even English, before even Unicode, and long before
> Mongolian was then encoded, only in Unicode and not in any known
> supported legacy charset: Mongolian was specified by borrowing the
> same NNBSP already designed for Latin, because the Mongolian space
> had no known specific behavior: the encoded whitespaces in Unicode
> are compeltely script-neutral, they are generic, and are even
> BiDi-neutral, they are all usable with any script).

The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.

Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:

1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.

Or is property (3) appropriate for French?

Richard.