On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:
On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

Covering existing character sets (National, International and Industry) was _an_ (not 
"the") important goal at the time: such coverage was understood as a necessary 
(although not sufficient) condition that would enable data migration to Unicode as well 
as enable Unicode-based systems to process and display non-Unicode data (by conversion).

I’d take this as a touchstone to infer that there were actual data files 
including standard typographic spaces as encoded in U+2000..U+2006, and 
electronic table layout using these: “U+2007 figure space has a fixed width, 
known as tabular width, which is the same width as digits used in tables. 
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?

May I remind you that the beginnings of Unicode predate the development of the world wide 
web. By 1993 the web had developed to where it was possible to easily access material 
written in different scripts and language, and by today it is certainly possible to 
"sample" material to check for character usage.

When Unicode was first developed, it was best to work from the definition of 
character sets and to assume that anything encoded in a give set was also used 
somewhere. Several corporations had assembled supersets of character sets that 
their products were supporting. The most extensive was a collection from IBM. 
(I'm blanking out on the name for this).

These collections, which often covered international standard character sets as 
well, were some of the prime inputs into the early drafts of Unicode. With the 
merger with ISO 10646 some characters from that effort, but not in the early 
Unicode drafts, were also added.

The code points from U+2000..U+2008 are part of that early collection.

Note, that prior to Unicode, no character set standard described in detail how 
characters were to be used (with exception, perhaps of control functions). 
Mostly, it was assumed that users knew what these characters were and the 
function of the character set was just to give a passive enumeration.

Unicode's character property model changed all that - but that meant that 
properties for all of the characters had to be determined long after they were 
first encoded in the original sources, and with only scant hints of the 
identity of what these were intended to be. (Often, the only hint was a 
character name and a rather poor bitmapped image).

If you want to know the "legacy" behavior for these characters, it is more useful, 
therefore, to see how they have been supported in existing software, and how they have been used in 
documents since then. That gives you a baseline for understanding whether any change or 
clarification of the properties of one of these code points will break "existing 
practice".

Breaking existing practice should be a dealbreaker, no matter how 
well-intentioned a change is. The only exception is where existing 
implementations are de-facto useless, because of glaring inconsistencies or 
other issues. In such exceptional cases, deprecating some interpretations of  
character may be a net win.

However, if there's a consensus interpretation of a given character the you can't just go 
in and change it, even if it would make that character work "better" for a 
given circumstance: you simply don't know (unless you research widely) how people have 
used that character in documents that work for them. Breaking those documents 
retroactively, is not acceptable.

That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs 
to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the 
*MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break 
for example those implementations relying on Gc=Zs for the purpose of applying 
a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use 
case of NNBSP: between an integer and a vulgar fraction, pointing an error in 
TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from 
occurring, which is required in style guides such as the Chicago Manual of 
Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the 
fraction is to be separated from a previous number, then a space can be used, 
choosing the appropriate width (normal, thin, zero width, and so on). For 
example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.”  Note 
that TUS has typeset this with the precomposed U+00BE, not with plain digits 
and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break 
property from A to GL does not break any implementation nor document. As of 
possible misuse of the character in ways other than intended, generally there 
is no point in using as breakable space a space that is actually just a thin 
variant of U+2007 FIGURE SPACE.

Hence the question, again: Why was PUNCTUATION SPACE not declared as 
non-breakable?

Marcel

That sample also raises concern, as it showcases how much is done or not done, 
as appropriate, to keep NNBSP off the usage in Latin script. To what avail?

Reply via email to