On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:
On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:
Covering existing character sets (National, International and Industry) was _an_ (not
"the") important goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data migration to Unicode as well
as enable Unicode-based systems to process and display non-Unicode data (by conversion).
I’d take this as a touchstone to infer that there were actual data files
including standard typographic spaces as encoded in U+2000..U+2006, and
electronic table layout using these: “U+2007 figure space has a fixed width,
known as tabular width, which is the same width as digits used in tables.
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?
May I remind you that the beginnings of Unicode predate the development of the world wide
web. By 1993 the web had developed to where it was possible to easily access material
written in different scripts and language, and by today it is certainly possible to
"sample" material to check for character usage.
When Unicode was first developed, it was best to work from the definition of
character sets and to assume that anything encoded in a give set was also used
somewhere. Several corporations had assembled supersets of character sets that
their products were supporting. The most extensive was a collection from IBM.
(I'm blanking out on the name for this).
These collections, which often covered international standard character sets as
well, were some of the prime inputs into the early drafts of Unicode. With the
merger with ISO 10646 some characters from that effort, but not in the early
Unicode drafts, were also added.
The code points from U+2000..U+2008 are part of that early collection.
Note, that prior to Unicode, no character set standard described in detail how
characters were to be used (with exception, perhaps of control functions).
Mostly, it was assumed that users knew what these characters were and the
function of the character set was just to give a passive enumeration.
Unicode's character property model changed all that - but that meant that
properties for all of the characters had to be determined long after they were
first encoded in the original sources, and with only scant hints of the
identity of what these were intended to be. (Often, the only hint was a
character name and a rather poor bitmapped image).
If you want to know the "legacy" behavior for these characters, it is more useful,
therefore, to see how they have been supported in existing software, and how they have been used in
documents since then. That gives you a baseline for understanding whether any change or
clarification of the properties of one of these code points will break "existing
practice".
Breaking existing practice should be a dealbreaker, no matter how
well-intentioned a change is. The only exception is where existing
implementations are de-facto useless, because of glaring inconsistencies or
other issues. In such exceptional cases, deprecating some interpretations of
character may be a net win.
However, if there's a consensus interpretation of a given character the you can't just go
in and change it, even if it would make that character work "better" for a
given circumstance: you simply don't know (unless you research widely) how people have
used that character in documents that work for them. Breaking those documents
retroactively, is not acceptable.
That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs
to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the
*MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break
for example those implementations relying on Gc=Zs for the purpose of applying
a background color to all (otherwise invisible) space characters.
By the occasion of that Public Review Issue, J. S. Choi reported another use
case of NNBSP: between an integer and a vulgar fraction, pointing an error in
TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from
occurring, which is required in style guides such as the Chicago Manual of
Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the
fraction is to be separated from a previous number, then a space can be used,
choosing the appropriate width (normal, thin, zero width, and so on). For
example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.” Note
that TUS has typeset this with the precomposed U+00BE, not with plain digits
and fraction slash.
If U+2008 PUNCTUATION SPACE is used as intended, changing its line break
property from A to GL does not break any implementation nor document. As of
possible misuse of the character in ways other than intended, generally there
is no point in using as breakable space a space that is actually just a thin
variant of U+2007 FIGURE SPACE.
Hence the question, again: Why was PUNCTUATION SPACE not declared as
non-breakable?
Marcel
That sample also raises concern, as it showcases how much is done or not done,
as appropriate, to keep NNBSP off the usage in Latin script. To what avail?