Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



For web searching, using the math-string 푀푎푦푛푎푟푑 퐾푒푦푛푒푠 as 
the keywords finds John Maynard Keynes in web pages.  Tested this in 
both Google and DuckDuckGo.  Seems like search engines are accomodating 
actual user practices.  This suggests that social media data is possibly 
already being processed for the benefit of the users (and future 
historians) by software people who care about such things.




Loose character-name matching

2019-01-17 Thread J. S. Choi
I’m implementing a Unicode names library. I’m confused about loose 
character-name matching, even after rereading The Unicode Standard § 4.8, UAX 
#34 § 4, #44 § 5.9.2 – as well as 
[L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt 
), 
[L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 
), and the 
[meeting in which those two items were 
resolved](https://www.unicode.org/L2/L2014/14026.htm 
).

In particular, I’m confused by the claim in The Unicode Standard § 4.8 saying, 
“Because Unicode character names do not contain any underscore (“_”) 
characters, a common strategy is to replace any hyphen-minus or space in a 
character name by a single “_” when constructing a formal identifier from a 
character name. This strategy automatically results in a syntactically correct 
identifier in most formal languages. Furthermore, such identifiers are 
guaranteed to be unique, because of the special rules for character name 
matching.”

I’m also confused by the relationship between UAX34-R3 and UAX44-LM2.

To make these issues concrete, let’s say that my library provides a function 
called getCharacter that takes a name argument, tries to find a loosely 
matching character, and then returns it (or a null value if there is no 
currently loosely matching character). So then what should the following 
expressions return?

getCharacter(“HANGUL-JUNGSEONG-O-E”)

getCharacter(“HANGUL_JUNGSEONG_O_E”)

getCharacter(“HANGUL_JUNGSEONG_O_E_”)

getCharacter(“HANGUL_JUNGSEONG_O__E”)

getCharacter(“HANGUL_JUNGSEONG_O_-E”)

getCharacter(“HANGUL JUNGSEONGCHARACTERO E”)

getCharacter(“HANGUL JUNGSEONG CHARACTER OE”)

getCharacter(“TIBETAN_LETTER_A”)

getCharacter(“TIBETAN_LETTER__A”)

getCharacter(“TIBETAN_LETTER _A”)

getCharacter(“TIBETAN_LETTER_-A”)

Thanks,
J. S. Choi



Re: NNBSP

2019-01-17 Thread Richard Wordingham via Unicode
On Thu, 17 Jan 2019 18:35:49 +0100
Marcel Schneider via Unicode  wrote:


> Among the grievances, Unicode is blamed for confusing Greek psili and
> dasia with comma shapes, and for misinterpreting Latin letter forms
> such as the u with descender taken for a turned h, and double u
> mistaken for a turned m, errors that subsequently misled font
> designers to apply misplaced serifs.

And I suppose that the influence was so great that it travelled back in
time to 1976, affecting the typography of the Pelican book 'Phonetics'
as reprinted in 1976.

Those IPA characters originated in a tradition where new characters had
been derived by rotating other characters so as to avoid having to have
new type cut.  Misplaced serifs appear to be original.

Richard.



Re: NNBSP

2019-01-17 Thread 梁海 Liang Hai via Unicode
[Just a quick note to everyone that, I’ve just subscribed to this public list, 
and will look into this ongoing Mongolian-related discussion once I’ve mentally 
recovered from this week’s UTC stress. :)]

Best,
梁海 Liang Hai
https://lianghai.github.io

> On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode  
> wrote:
> 
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>>> [quoted mail]
>>> 
>>> But the French "espace fine insécable" was requested long long before 
>>> Mongolian was discussed for encodinc in the UCS. The problem is that the 
>>> initial rush for French was made in a period where Unicode and ISO were 
>>> competing and not in sync, so no agreement could be found, until there was 
>>> a decision to merge the efforts. Tge early rush was in ISO still not using 
>>> any character model but a glyph model, with little desire to support 
>>> multiple whitespaces; on the Unicode side, there was initially no desire to 
>>> encode all the languages and scripts, focusing initially only on trying to 
>>> unify the existing vendor character sets which were already implemented by 
>>> a limited set of proprietary vendor implementations (notably IBM, 
>>> Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
>>> including the existing ISO 8859-*, GBK, and some national standard or de 
>>> facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this 
>>> time but still using another unrelated technology). Font standards were 
>>> still not existing and were competing in incompatible ways, all was a mess 
>>> at that time, so publishers were still required to use proprietary software 
>>> solutions, with very low interoperability (at that time the only "standard" 
>>> was PostScript, not needing any character encoding at all, but only 
>>> encoding glyphs!)
>> 
>> Thank you for this insight. It is a still untold part of the history of 
>> Unicode.
> This historical summary does not square in key points with my own 
> recollection (I was there). I would therefore not rely on it as if gospel 
> truth.
> 
> In particular, one of the key technologies that brought industry partners to 
> cooperate around Unicode was font technology, in particular the development 
> of the TrueType Standard. I find it not credible that no typographers were 
> part of that project :).
> 
> Covering existing character sets (National, International and Industry) was 
> an (not "the") important goal at the time: such coverage was understood as a 
> necessary (although not sufficient) condition that would enable data 
> migration to Unicode as well as enable Unicode-based systems to process and 
> display non-Unicode data (by conversion). 
> 
> The statement: "there was initially no desire to encode all the languages and 
> scripts" is categorically false.
> 
> (Incidentally, Unicode does not "encode languages" - no character encoding 
> does).
> 
> What has some resemblance of truth is that the understanding of how best to 
> encode whitespace evolved over time. For a long time, there was a confusion 
> whether spaces of different width were simply digital representations of 
> various metal blanks used in hot metal typography to lay out text. As the 
> placement of these was largely handled by the typesetter, not the author, it 
> was felt that they would be better modeled by variable spacing applied 
> mechanically during layout, such as applying indents or justification.
> 
> Gradually it became better understood that there was a second use for these: 
> there are situations where some elements of running text have a gap of a 
> specific width between them, such as a figure space, which is better treated 
> like a character under authors or numeric formatting control than something 
> that gets automatically inserted during layout and rendering.
> 
> Other spaces were found best modeled with a minimal width, subject to 
> expansion during layout if needed.
> 
> 
> 
> There is a wide range of typographical quality in printed publication. The 
> late '70s and '80s saw many books published by direct photomechanical 
> reproduction of typescripts. These represent perhaps the bottom end of the 
> quality scale: they did not implement many fine typographical details and 
> their prevalence among technical literature may have impeded the 
> understanding of what character encoding support would be needed for true 
> fine typography. At the same time, Donald Knuth was refining TeX to restore 
> high quality digital typography, initially for mathematics.
> 
> However, TeX did not have an underlying character encoding; it was using a 
> completely different model mediating between source data and final output. 
> (And it did not know anything about typography for other writing systems).
> 
> Therefore, it is not surprising that it took a while and a few false starts 
> to get the encoding model correct for space characters.
> 
> Hopefully, well 

Re: NNBSP

2019-01-17 Thread Asmus Freytag via Unicode

  
  
On 1/17/2019 9:35 AM, Marcel Schneider
  via Unicode wrote:


  
[quoted mail]
  


But the French "espace fine insécable" was requested
  long long before Mongolian was discussed for encodinc in
  the UCS. The problem is that the initial rush for French
  was made in a period where Unicode and ISO were competing
  and not in sync, so no agreement could be found, until
  there was a decision to merge the efforts. Tge early rush
  was in ISO still not using any character model but a glyph
  model, with little desire to support multiple whitespaces;
  on the Unicode side, there was initially no desire to
  encode all the languages and scripts, focusing initially
  only on trying to unify the existing vendor character sets
  which were already implemented by a limited set of
  proprietary vendor implementations (notably IBM,
  Microsoft, HP, Digital) plus a few of the registered
  chrsets in IANA including the existing ISO 8859-*, GBK,
  and some national standard or de facto standards (Russia,
  Thailand, Japan, Korea).
This early rush did not involve typographers (well
  there was Adobe at this time but still using another
  unrelated technology). Font standards were still not
  existing and were competing in incompatible ways, all was
  a mess at that time, so publishers were still required to
  use proprietary software solutions, with very low
  interoperability (at that time the only "standard" was
  PostScript, not needing any character encoding at all, but
  only encoding glyphs!)
  

  
  
  Thank you for this insight. It is a still untold part of the
  history of Unicode.
This historical summary does not square
in key points with my own recollection (I was there). I would
therefore not rely on it as if gospel truth.
  
In particular, one of the key technologies
that brought industry partners to cooperate around Unicode
was font technology, in particular the development of the TrueType
Standard. I find it not credible that no typographers were
part of that project :).
Covering existing character sets (National,
International and Industry) was an (not "the") important
goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data
migration to Unicode as well as enable Unicode-based systems to
process and display non-Unicode data (by conversion). 
  
The statement: "there was initially no
desire to encode all the languages and scripts" is categorically
false.
(Incidentally, Unicode does not "encode
languages" - no character encoding does).
What has some resemblance of truth is that
the understanding of how best to encode whitespace evolved over
time. For a long time, there was a confusion whether spaces of
different width were simply digital representations of various
metal blanks used in hot metal typography to lay out text. As
the placement of these was largely handled by the typesetter,
not the author, it was felt that they would be better modeled by
variable spacing applied mechanically during layout, such as
applying indents or justification.
  
Gradually it became better understood that
there was a second use for these: there are situations where
some elements of running text have a gap of a specific width
between them, such as a figure space, which is better treated
like a character under authors or numeric formatting control
than something that gets automatically inserted during layout
and rendering.
Other spaces were found best modeled with a
minimal width, subject to expansion during layout if needed.

  
There is a wide range of typographical
quality in printed publication. The late '70s and '80s saw many
books published by direct photomechanical reproduction of
typescripts. These represent perhaps the bottom end of the
quality scale: they did not implement many fine typographical
details and their prevalence among technical literature may have
impeded the understanding of what character encoding support
would be needed for true fine typography. At the same time,
Donald Knuth was refining TeX to restore high quality digital
typography, initially for mathematics.
However, TeX did not have an underlying
character encoding; it was using a completely different model
 

Re: wws dot org

2019-01-17 Thread Frédéric Grosshans via Unicode

  
  
Thanks for this nice website !



Some feedback:



  
Given the number of scripts in this period, I think that
  splitting 10c-19c in two (or even three) would be a good idea
A finer unicode status would be nice


Coptic is listed as European, while, I think it is Africac,
  (even if a member of the LGC (LAtin-Greek-Cyrillic) family
  since, to my knowledge, it has only be used in Africa for
  African llanguages (Coptic and Old Nubian). 

Coptic still used for religious purpose today. Why to you
  write it dead in the 14th century ?
Khitan Small Script: According to Wikipedia, it “was
invented in about 924 or 925 CE”, not
  920 (that is the date of the Khitan Large Script
Cyrillic I think its birth date is 890s, slightly more
  precice than the 10c you write
You include two well known Tolkienian scripts (Cirth and
  Tengwar), but you omit the third (first ?) one, the Sarati
  (see e.g. http://at.mansbjorkman.net/sarati.htm and
  https://en.wikipedia.org
  
  On a side note, you the site considers visible speech as a
living-script, which surprised be. This information is indeed in
the Wikipedia infobox and implied by its “HMA status” on the
Berkeley SEI page, but the text of the wikipedia page says “However, although heavily promoted
  [...] in 1880, after a period of a dozen years or so
  in which it was applied to the education of the deaf, Visible
  Speech was found to be more cumbersome [...] compared to other
  methods, and eventually faded from use.”
  My (cursory) research failed to show a more recent date than
this for the system than this “dosen of year or so [past 1880]”
. Is there any indication of the system to be used later? (say,
any date in the 20th century)
  
  
  All the best,
  
  
     Frédéric
  

Le 15/01/2019 à 19:22, Johannes
  Bergerhausen via Unicode a écrit :


  
  Dear list,
  
  I am happy to
report that www.worldswritingsystems.org is now online.
  
  The web site is
a joint venture by
  

—
Institut Designlabor Gutenberg (IDG), Mainz, Germany,
  — Atelier
National de Recherche Typographique (ANRT), Nancy, France and
—
Script Encoding Initiative (SEI), Berkeley, USA.
  
  For every
known script, we researched and designed a reference glyph.
  
  You can sort
these 292 scripts by Time, Region, Name, Unicode version and
Status.
Exactly half of them (146) are already encoded in
  Unicode.
  
  Here you can
find more about the project:
  www.youtube.com/watch?v=CHh2Ww_bdyQ


And is a link to see the poster:
https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/
  
  All the
best,
  Johannes





  

  

  







  ↪ Prof.
Bergerhausen
  Hochschule
  Mainz, School
of Design, Germany
  www.designinmainz.de
  www.decodeunicode.org

  
  
  
  
  
  
  

  

  

  



  



Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)


Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they 
have no computer science training, and because they were feared as trying to 
enforce requirements that Unicode were neither able nor willing to meet, such 
as distinct code points for italics, bold, small caps…

Among the grievances, Unicode is blamed for confusing Greek psili and dasia 
with comma shapes, and for misinterpreting Latin letter forms such as the u 
with descender taken for a turned h, and double u mistaken for a turned m, 
errors that subsequently misled font designers to apply misplaced serifs. 
Things were done in a hassle and a hurry, under the Damokles sword of a hostile 
ISO messing and menacing to unleash an unusable standard if Unicode wasn’t 
quicker.


If publishers had been involded, they would have revealed that they all needed 
various whitespaces for correct typography (i.e. layout). Typographs themselves 
did not care about whitespaces because they had no value for them (no glyph to 
sell).


Nevertheless the whole range of traditional space forms was admitted, despite 
they were going to be of limited usability. And they were given properties.
Or can’t the misdefinition of PUNCTUATION SPACE be backtracked to that era?


Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like 
Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required 
us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, 
dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained 
their typists or ads sellers to use it (that character was not "sold" in classified ads, 
it was necessary for correct layout, notably in narrow columns, not using it confused the readers 
(notably for the ":" colon): it had to be non-breaking, non-expanding by justification, 
narrower than digits and even narrower than standard non-justified whitespace, and was consistently 
used as a decimal grouping separator.


No doubt they were confident that when an UCS is set up, such an important 
character wouldn’t be skipped.
So confident that they never guessed that they had a key role in reviewing, in 
providing feedback, in lobbying.
Too bad that we’re still so few people today, corporate vetters included, 
despite much things are still going wrong.


But at that time the most common OSes did not support it natively because there 
was no vendor charset supporting it (and in fact most OSes were still unable to 
render proportional fonts everywhere and were frequently limited to 8-bit 
encodings (DOS, Windows, Unix(es), and even Linux at its early start).


Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren’t ready. Not 
even the NNBSP.

Perhaps it’s the poetic ‘justice of time’ that since Unicode is on, the 
Vietnamese are the foremost, and the French the hindmost.
[I’m alluding to the early lobbying of Vietnam for a comprehensive set of 
precomposed letters, while French wasn’t even granted to come into the benefit 
of the NNBSP – that according to PRI #308 [1] is today the only known use of 
NNBSP outside Mongolian – and a handful ordinal indicators (possibly along with 
the rest of the alphabet, except q).

[1] “The only other widely noted use for U+202F NNBSP is for representation of the 
thin non-breaking space (/espace fine 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 14:36, I wrote:

[…]
The only thing that searches have brought up


It was actually the best thing. Here’s an even more surprising hit:

   B. In the rules, allow these characters to bridge both 
alphabetic and numeric words, with:

 * Replace MidLetter by (MidLetter | MidNumLet)
 * Replace MidNum by (MidNum | MidNumLet)


   -

   4. In addition, the following are also sometimes used, or could 
be used, as numeric separators (we don't give much guidance as to the best 
choice in the standard):

   |0020 |( ) 
SPACE
   |00A0 |(   
) NO-BREAK SPACE
   |2007 |(   
) FIGURE SPACE
   |2008 |(   
) PUNCTUATION SPACE
   |2009 |(   
) THIN SPACE
   |202F |(   
) NARROW NO-BREAK SPACE

   If we had good reason to believe that if one of these only 
really occurred between digits in a single number, then we could add it. I 
don't have enough information to feel like a proposal for that is warranted, 
but others may. Short of that, we should at least document in the notes that 
some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn’t Unicode wish to sort out 
which one of these is the group separator?

1. SPACE: is breakable, hence exit.
2. NO-BREAK SPACE: is justifying, hence exit.
3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
4. PUNCTUATION SPACE: has been left breakable against all reason and evidence 
and consistency, hence exit…
5. THIN SPACE: is part of the breakable spaces series, hence exit.
6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey 35 
all is questioned again, must be assessed, may impact implementations, while 
all other locales using space are still impacted by bad display using NO-BREAK 
SPACE.

I know we have another public Mail List for that, but I feel it’s important to 
submit this to a larger community for consideration and eventually, for 
feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt









Re: Encoding italic (was: A last missing link)

2019-01-17 Thread Philippe Verdy via Unicode
If encoding italics means reencoding the normal linguistic usage, it is no
! We already have the nightmares caused by partial encoding of Latin and
Greek (als a few Hebrew characters) for maths notations or IPA notations,
but they are restricted to a well delimited scope of use and subset, and at
least they have relevant scientific sources and auditors for what is needed
in serious publications (Anyway these subsets may continue to evolve but
very slowly).
We could have exceptions added for chemical or electrical notations, if
there are standard bodies supporting them.
But for linguistic usage, there's no universal agreement and no single
authority. Characters are added according to common use (by statistic
survey, or because there are some national standard promoting them and
sometimes making their use mandatory with defined meanings, sometimes
legally binding).
For everything else, languages are not constrained and users around the
world invent their own letterforms, styles: there' no limit at all and if
we start accepting such reencoding, the situation would in fact be worse in
terms of interoperability ,because noone can support zillions variants if
they are not explicitly encoded separately as surrounding styles, or
scoping characters if needed (using contextual characters, possibly variant
selectors if these variants are most often isolated).
But italics encoded as varaint selectors would just pollute everything; and
anyway "italic" is not a single universal convention and does not apply
erqually to all scripts). The semantics attached to italic styles also
varies from document to documents, and the sema semantics also have
different typographic conventions depending on authors, and there's no
agreed meaning bout the distinctions they encode.
For this reason "italique/oblique/cursive/handwriting..." should remain in
styles (note also that even the italic transform can be variable, it could
also be later a subject of user preferences where people may want to adjust
the degree or slanting, according to their reading preferences, or its
orientation if they are left-handed to match how they write themselves, or
if the writer is a native RTL writer; the context of use (in BiDi) may also
adject this slanting orientation, e.g. inserting some Latin in Arabic could
present the Latin italic letters slanted backward, to better match the
slanting of Arabic itself and avoid collisions of Latin and Arabic glyphs
at BiDi boundaries...
One can still propose a contextual control character, but it would still be
insufficient for correctly representing the many stylistic variants
possible: we have better languages to do that now, and CSS (or even HTML)
is better for it (including for accessibility requirements: note that
there's no way to translate corretly these italics to Braille readers for
example; Braille or audio readers attempt to infer an heuristic to reduce
the number of contextual words or symbols they need to insert between each
character, but using VSn characters would complicate that: they are already
processing the standard HTML/CSS conventions to do that much more simply).
direct native encoding of italic characters for lingusitic use would fail
if it only covers English: it would worsen the language coverage if people
are then said to remove the essential diacritics common in their language,
only because of the partial coverage of their alphabet.
I don't think this is worth the effort (and it would in fact cause lot of
maintenance and would severely complicate the addition of new missing
letters; and let's not forget the case of common ligatures, correct
typograhpic features like kerning which would no longer be supported and
would render ugly text if many new kerning pairs are missing in fonts, many
fonts used today would no longer work properly, we would have a reduction
of stylistic options and less fonts usable, and we would fall into the trap
of proprietary solutions with a single provider; it would be too difficult
or any font designer to start defining a usable font sellable on various
market: these fonts would be reduced to niches, and would no longer find a
way to be economically defined and maintained at reasonable cost.
Consider the problems orthogonally: even if you use CSS/HTML styles in
document encoding (rather than the plain text character encoding) you can
also supply the additional semantics clearly in that document, and also
encode the intent of the author, or supply enough info to permit alternate
renderings (for accessibility, or for technical reasons such as small font
sizes on devices will low resolution, or for people with limited vision
capabilities). the same will apply to color (whose meaning is not clear,
except in specific notations supported by wellknown authorities, or by a
long tradition shared by many authors and kept in archives or important
text corpus, such as litterature, legal, and publications that have fallen
to the public domain after their iniçtial publisher 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS.


Then we should be able to read its encoding proposal in the UTC document 
registry, but Google Search seems unable to retrieve it, so there is a big risk 
that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX #14 
that I’ve quoted in the parent thread has been added by a Unicode Technical 
Director not mentioned in the author field, and that he did it on request from 
two gentlemen whose first names only are cited. I’m sure their full names are 
Martin J. Dürst and Patrick Andries, but I may be wrong.

I apologize for the comment I’ve made in my e‑mail. Still it would be good to 
learn why the French use of NNBSP is sort of taken with a grain of salt, while 
all involved parties were knowing that this NNBSP was (as it still is) the only 
Unicode character ever encoded able to represent the so-long-asked-for “espace 
fine insécable.”

There is also another question I’m asking since a while: Why the character U+2008 
PUNCTUATION SPACE wasn’t given the line break property value "GL" like its 
sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as “2007-08-08”. Why was the Core 
Specification not updated in sync, but only a 7 years later? And was Unicode 
aware that this whitespace is hated by the industry to such an extent that a 
major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to 
beg for font support?


Regards,

Marcel


The problem is that the initial rush for French was made in a period where 
Unicode and ISO were competing and not in sync, so no agreement could be found, 
until there was a decision to merge the efforts. Tge early rush was in ISO 
still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no 
desire to encode all the languages and scripts, focusing initially only on 
trying to unify the existing vendor character sets which were already 
implemented by a limited set of proprietary vendor implementations (notably 
IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
including the existing ISO 8859-*, GBK, and some national standard or de facto 
standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
and was consistently used as a decimal grouping separator.


But at that time the most common OSes did not support it natively because there was no 
vendor charset supporting it (and in fact most OSes were still unable to render 
proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, 
Windows, Unix(es), and even Linux at its early start). So intermediate solution was 
needed. Us chose not to use at all the non-breakable thin space because in English it was 
not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for 
everything (but including its own national symbol for the "$", competing with 
other ISO 646 variants). There were tons of legacy applications developed ince decenials 
that did not support anything else and interoperability in US was available ony with 
ASCII, everything else was unreliable.

If you remember the early years when the Internet started to develop outside US, you remember the 

Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



On 2019-01-17 11:50 AM, Martin J. Dürst wrote:

> Most probably not. I think Asmus has already alluded to it, but in good
> typography, roman and italic fonts are considered separate.

So are Latin and Cyrillic fonts.  So are American English and Polish 
fonts, for that matter, even though they're both Latin based.  Times New 
Roman and Times New Roman Italic might be two separate font /files/ on 
computers, but they are the same type face.


The point I was trying to make WRT 256-glyph fonts is that they pre-date 
Unicode and I believe much of the "layering" is based on artifacts from 
that era.


Lead fonts were glyph based.  The technical concept of character came later.



Re: Encoding italic

2019-01-17 Thread Martin J . Dürst via Unicode
On 2019/01/17 17:51, James Kass via Unicode wrote:
> 
> On 2019-01-17 6:27 AM, Martin J. Dürst replied:

>  > ...
>  > Based by these data points, and knowing many of the people involved, my
>  > description would be that decisions about what to encode as characters
>  > (plain text) and what to deal with on a higher layer (rich text) were
>  > taken with a wide and deep background, in a gradually forming industry
>  > consensus.
> 
> (IMO) All of which had to deal with the existing font size limitations 
> of 256 characters and the need to reserve many of those for other 
> textual symbols as well as box drawing characters.  Cause and effect. 
> The computer fonts weren't designed that way *because* there was a 
> technical notion to create "layers".  It's the other way around.  (If 
> I'm not mistaken.)

Most probably not. I think Asmus has already alluded to it, but in good 
typography, roman and italic fonts are considered separate. They are 
often used in sets, but it's not impossible e.g. to cut a new italic to 
an existing roman or the other way round. This predates any 8-bit/256 
characters limitations. Also, Unicode from the start knew that it had to 
deal with more than 256 characters, not only for East Asia, and so I 
don't think such size limits were a major issue when designing Unicode.

On the other hand, the idea that all Unicode characters (or a 
significant and as yet undetermined subset of them) would need 
italic,... variants definitely will have let do shooting down such 
ideas, in particular because Unicode started as strictly 16-bit.

Regards,   Martin.



Re: NNBSP

2019-01-17 Thread Marcel Schneider via Unicode

On 17/01/2019 09:58, Richard Wordingham wrote:


On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:


Also, at least one French typographer was extremely upset
about Unicode not gathering feedback from typographers.
That blame is partly wrong since at least one typographer
was and still is present in WG2, and even if not being a
Frenchman (but knowing French), as an Anglophone he might
have been aware of the most outstanding use case of NNBSP
with English (both British and American) quotation marks
when a nested quotation starts or ends a quotation, where
_‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
least with proportional fonts.


There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.


Thanks, I didn’t know that this is already implemented. Sometimes one can
read in discussions that the issue is dismissed to font level. That looked
always utopistic to me, the more as people are trained to type spaces when
bringing in former typewriting expertise, and I always believed that it’s
a way for helpless keyboard layout designers to hand the job over.

Turns out there is more to it. But the high-end solution notwithstanding,
the use of an extra space character is recommended practice:

https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html

The source sums up in an overview: “_The Associated Press Stylebook_
recommends a thin space, whereas _The Gregg Reference Manual_ promotes a
full space between the quotation marks. _The Chicago Manual of Style_ says
no space is necessary but adds that a space or a thin space can be inserted
as ‘a typographical nicety.’ ” The author cites three other manuals for not
having retrieved any locus about the topic in them.

We note that all three style guides seem completely unconcerned with
non-breakability. Not so the author of the blog post: “[…] If your software
moves the double quotation mark to the next line of type, use a nonbreaking
space between the two marks to keep them together.” Certainly she would
recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard
or if the software provided a handy shortcut by default.



This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)


Another drawback is that most environments don’t provide OpenType support,
and that the whole scheme depends on language tags that could easily got
lost, and that the issue as being particular to French would quickly boil
down to dismiss support as not cost-effective, arguing that *if* some
individual locale has special requirements for punctuation layout, its
writers are welcome to pick an appropriate space from the UCS and key it
in as desired.

The same is also observed about Mongolian. Today, the preferred approach
for appending suffixes is to encode a Mongolian Suffix Connector to make
sure the renderer will use correct shaping, and to leave the space to the
writer’s discretion. That looks indeed much better than to impose a hard
space that unveiled itself as cumbersome in practice, and that is
reported to often get in the way of a usable text layout.

The problems related to NNBSP as encountered in Mongolian are completely
absent when NNBSP is used with French punctuation or as the regular
group separator in numbers. Hence I’m sure that everybody on this List
agrees in discouraging changes made to the character properties of NNBSP,
such as switching the line breaking class (as "GL" is non-tailorable), or
changing general category to Cf, which could be detrimental to French.

However we need to admit that NNBSP is basically not a Latin but a
Mongolian space, despite being readily attracted into Western typography.
A similar disturbance takes place in word processors, where except in
Microsoft Word 2013, the NBSP is not justifying as intended and as it is
on the web. It’s being hacked and hijacked despite being a bad compromise,
for the purpose of French punctuation spacing. That tailoring is in turn
very detrimental to Polish users, among others, who need a justifying
no-break space for the purpose of prepending one-letter prepositions.

Fortunately a Polish user found and shared a workaround using the string
, the latter being still used in lieu of WORD JOINER as
long as Word keeps unsupporting latest TUS (an issue that raised concern
at Microsoft when it was reported, and will probably be fixed or has
already been fixed meanwhile).



Another spacing mess occurs 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Philippe Verdy via Unicode
Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode  wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nightmare of non-interoperable 8-bit charsets
and the famous "mojibake" we saw everywhere. Then the competition between
ISO and Unicode lasted too long. But it was considered "too late" for
French to change anything (and Windows used 

Re: Encoding italic (was: A last missing link)

2019-01-17 Thread Victor Gaultney via Unicode

Andrew Cunningham wrote:
Underlying, bold text, interletter spacing, colour change, font style 
change all are used to apply meaning in various ways. Not sure why 
italic is special in this sense.


Italic is uniquely different from these in that the meaning has been 
well-established in our writing system for centuries, and is 
consistently applied. The only one close to being this is use of 
interletter spacing for distinction, particularly in the German and 
Czech tradition. Of course, a model that can encode span-like features 
such as italic could then support other types of distinction. However 
the meaning of that distinction within the writing system must be clear. 
IOW people do use colour change to add meaning, but the meaning is not 
consistent, and so preserving it in plain text is relatively pointless. 
Even Bold doesn't have a consistent meaning other than  - but 
that's a separate conversation.


And I am curious on your thoughts, if we distinguish italic in 
Unicode, encode some way of spacifying italic text, wouldn't it make 
more sense to do away with italic fonts all together? and just roll 
the italic glyphs into the regular font?


That's actually being done now. OpenType variation fonts allow a variety 
of styles within a single 'font', although I personally feel using that 
for italic is misguided.


The reality is that the most commonly used Latin fonts - OS core ones - 
all have italic counterparts, so app creators only have to switch to 
using that counterpart for that span. And if the font has no italic 
counterpart then a fallback mechanism can kick in - just like is done 
when a font doesn't have a glyph to represent a character.


In theory changing italic from a stylistic choice as it currently is 
to a encoding/character level semantic is a paradigmn shift.


Yes it would be - but it could be a beneficial shift, and one that more 
completely reflects distinctions in the Latin script that go back over 
400 years.


But it it were introduced I would prefer a system that was more 
inclusive of all scripts, giving proper analysis of typeseting and 
typographic conventions in each script and well founded decisions on 
which should be encoded. Cherry picking one feature relevant to a 
small set of scripts seems to be a problematic path.


The core issue here is not really italic in Latin - that's only one 
case. An adjusted text model that supports span-like text features, 
could also unlock benefits for other scripts that have consistent 
span-like features.






Re: Encoding italic (was: A last missing link)

2019-01-17 Thread Victor Gaultney via Unicode
( I appreciate that UTC meetings are going on - I too will be traveling 
a bit over the next couple of weeks, so may not respond quickly. )


Support for marking 'italic' in plain text - however it's done - would 
certainly require changes in text processing. That would also be the 
case for some of the other span-like issues others have mentioned. 
However a clear model for how to handle that could solve all the issues 
at once. Italic would only be one application of that model, and only 
applicable to certain scripts. Other scripts might have parallel issues. 
BTW - I'm speaking only about span-like things that encode content, not 
the additional level of rich-text presentation.


If however, we say that this "does not adequately consider the harm done 
to the text-processing model that underlies Unicode", then that exposes 
a weakness in that model. That may be a weakness that we have to accept 
for a variety of reasons (technical difficulty, burden on developers, UI 
impact, cost, maturity).


We then have to honestly admit that the current model cannot always 
unambiguously encode text content in English and many other languages. 
It is impossible to express Crystal's distinction between 'red slippers' 
and '/red/ slippers' in plain text without using other characters in 
non-standardized ways. Here I am using my favourite technique for this - 
/slashes/.


There are other uses of italic that indicate difference in actual 
meaning, many that go back centuries, and for which other span-like 
punctuation like quotes aren't used. Examples:


- Titles of books, films, compositions, works of art: 'Daredevil' - the 
Marvel comics character vs. '/Daredevil/' - the Netflix series.


- Internal voice, such as a character's private thoughts within a 
narrative: 'She pulled out a knife. /What are you doing? How did you 
find out.../'


- Change of author/speaker, as in editorial comments: '/The following 
should be considered.../'


- Heavy stress in speech, which is different than Crystal's distinction: 
'Come here /this instant/'


- Examples: 'The phrase /I could care less/...' (quotes are sometimes 
used for this one)


Is it important to preserve these distinctions in plain text? The text 
seems 'readable' without them, but that requires some knowledge of 
context. And without some sort of other marking, as I've done, some of 
the meaning is lost. This is why italics within text have always been 
considered an editorial decision, not a typesetting one.


In a similar way, we really don't need to include diacritics when 
encoding French. In all but a few rare cases, French is perfectly 
'readable' without accents - the content can usually be inferred from 
context. Yet we would never consider unaccented French to be correct.


More evidence for italics as an important element within encoded text 
comes from current use. A couple of years ago I collected every tweet 
that referred to italics for a month. People frequently complained that 
they were not able to express themselves fully without italics, and 
resorted to 40 different techniques to try and mark words and phrases as 
'italic'.


In the current model, plain text cannot fully preserve important 
distinctions in content. Maybe we just need to admit and accept that. 
But maybe an enhancement to the text processing model would enable more 
complete encoding of content, both for italics in Latin script and for 
other features in other scripts.


As for how the UIs of the world would need to change: Until there is a 
way to encode italic in plain text there's no motivation for people to 
even experiment and innovate.






Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode

Courier New was lacking NNBSP on Windows 7. It is including it on
Windows 10. The tests I referred to were made 2 years ago. I
confess that I was so disappointed to see Courier New unsupporting
NNBSP a decade after encoding, while many relevant people in the
industry were surely aware of its role and importance for French
(at least those keeping a branch office in France), that I gave it
up. Turns out that foundries are delaying support until the usage
is backed by TUS, which happened in 2014, timely for Windows 10.
(I’m lacking hints about Windows 8 and 8.1.)

Superscripts are a handy parallel showcasing a similar process.
As long as preformatted superscripts are outlawed by TUS for use
in the digital representation of abbreviation indicators, vendors
keep disturbing their glyphs with what one could start calling an
intentional metrics disorder (IMD). One can also rank the vendors
on the basis of the intensity of IMD in preformatted superscripts,
but this is not the appropriate thread, and anyhow this List is
not the place. A comment on CLDR ticket #11653 is better.

[…]

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. […]

Marcel


Re: NNBSP

2019-01-17 Thread Richard Wordingham via Unicode
On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:

> Also, at least one French typographer was extremely upset
> about Unicode not gathering feedback from typographers.
> That blame is partly wrong since at least one typographer
> was and still is present in WG2, and even if not being a
> Frenchman (but knowing French), as an Anglophone he might
> have been aware of the most outstanding use case of NNBSP
> with English (both British and American) quotation marks
> when a nested quotation starts or ends a quotation, where
> _‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
> unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
> least with proportional fonts.

There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.

This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)

Another spacing mess occurs with the Thai repetition mark U+0E46 THAI
CHARACTER MAIYAMOK, which is supposed to be separated from the
duplicated word by a space.  I'm not sure whether this space should
expand for justification any more often than inter-letter spacing. Some
fonts have taken to including the preceding space in the character's
glyph, which messes up interoperability.  An explicit space looks ugly
when the font includes the space in the repetition mark, and the lack of
an explicit space looks illiterate when the font excludes the leading
space.

Richard.



Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



On 2019-01-17 6:27 AM, Martin J. Dürst replied:

> ...
> So even if you can find examples where the presence or absence of
> styling clearly makes a semantic difference, this may or will not be
> enough. It's only when it's often or overwhelmingly (as opposed to
> occasionally) the case that a styling difference makes a semantic
> difference that this would start to become a real argument for plain
> text encoding of italics (or other styling information).

(also from PDF chapter 2,)
"Plain text is public, standardized, and universally readable."
The UCS is universal, which implies that even edge cases, such as failed 
or experimental historical orthographies, are preserved in plain text.


> ...
> I think most Unicode specialists have chosen to ignore this thread by
> this point.

Those not switched off by the thread title may well be exhausted and 
pressed for time because of the UTC meeting.


> ...
> Based by these data points, and knowing many of the people involved, my
> description would be that decisions about what to encode as characters
> (plain text) and what to deal with on a higher layer (rich text) were
> taken with a wide and deep background, in a gradually forming industry
> consensus.

(IMO) All of which had to deal with the existing font size limitations 
of 256 characters and the need to reserve many of those for other 
textual symbols as well as box drawing characters.  Cause and effect.  
The computer fonts weren't designed that way *because* there was a 
technical notion to create "layers".  It's the other way around.  (If 
I'm not mistaken.)


>> ..."Jackie Brown"...
> ...
> Also, for probably at least 90% of
> the readership, the style distinction alone wouldn't induce a semantic
> distinction, because most of the readers are not familiar with these
> conventions.

Proper spelling and punctuation seem to be dwindling in popularity, as 
well.  There's a percentage unable to make a semantic distinction 
between 'your' and 'you’re'.


> (If you doubt that, please go out on the street and ask people what
> italics are used for, and count how many of them mention film titles or
> ship names.)

Or the em-dash, en-dash, Mandaic letter ash, or Gurmukhi sign yakash.  
Sure, most street people have other interests.


> (And just while we are at it, it would still not be clear which of
> several potential people named "Jackie Brown" or "Thorstein Veblen"
> would be meant.)

Isn't that outside the scope of italics?  (winks)