from:"Marcel Schneider via Unicode"

Re: NNBSP

2019-01-19 Thread Marcel Schneider via Unicode


On 19/01/2019 09:42, Asmus Freytag via Unicode wrote:

[…]

For one, many worthwhile additions / changes to Unicode depend on getting written up in 
proposal form and then championed by dedicated people willing to see through the process. 
Usually, Unicode has so many proposals to pick from that at each point there are more 
than can be immediately accommodated. There's no automatic response to even issues that 
are "known" to many people.

"Demands" don't mean a thing, formal proposals, presented and then refined 
based on feedback from the committee is what puts issues on the track of being resolved.


That is also what I suspected, that the French were not eager enough to get 
French supported, as opposed to the Vietnamese who lobbied long before the era 
of proposals and UTC meetings.

Please,/where can we find the proposals for FIGURE SPACE to become 
non-breakable, and for PUNCTUATION SPACE to stay or become breakable?/

(That is not a rhetoric question. The ideal answer is a URL.
Also, that is not about pre-Unicode documentation, but about the action that 
Unicode took in that era.)


[…]

Yes, I definitely used an IBM Selectric for many years with interchangeable type wheels, 
but I don't remember using proportional spacing, although I've seen it in the kinds of 
"typescript" books I mentioned. Some had that crude approximation of 
typesetting.


Thanks for reporting.


When Unicode came out, that was no longer the state of the art as TeX and laser 
printers weren't limited that way.

However, the character sets from which Unicode was assembled (or which it had 
to match, effectively) were designed earlier - during those times. And we 
inherited some things (that needed to be supported so round-trip mapping of 
data was possible) but that weren't as well documented in their particulars.

I'm sure we'll eventually deprecate some and clean up others, like the 
Mongolian encoding (which also included some stuff that was encoded with an 
understanding that turned out less solid in retrospect than we had thought at 
the time).

Something the UTC tries very hard to avoid, but nobody is perfect. It's best 
therefore to try not to ascribe non-technical motives to any action or inaction 
of the UTC. What outsiders see is rarely what actually went down,


That is because the meeting minutes would gain in being more explicit.


and the real reasons for things tend to be much less interesting from an 
interpersonal  or intercultural perspective.


I don’t care about “interesting” reasons. I’d just appreciate to know the truth.


So best avoid that kind of topic altogether and never use it as basis for 
unfounded recriminations.


When you ask for knowing the foundations and that knowledge is persistently 
refused, you end up believing that those foundations just can’t be told.

Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, 
where it actually belongs to. I’d kindly request not to be considered a 
hypocrite that in reality keeps blaming the UTC.


A./

Re: NNBSP

2019-01-19 Thread Marcel Schneider via Unicode

On 19/01/2019 01:21, Shawn Steele wrote:

*>> *If they are obsolete apps, they don’t use CLDR / ICU, as these are
designed for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU. Obsolete apps run on Windows. That statement is a
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any
thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it.
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

Thanks for sharing. As it happens, I like most the first reason you provide:

* “The most obvious reason is that there is a bug in the data and we had to
make a change. (Believe it or not we make mistakes ;-)) In this case our users
(and yours too) want culturally correct data, so we have to fix the bug even if
it breaks existing applications.”

No comment :)

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

No problem.

>> What are all these expected to do while localized with scripts outside
Windows code pages?

(We call those “unicode-only” locales FWIW)

Noted.

The users that are not supported by legacy apps can’t use those apps
(obviously). And folks are strongly encouraged to write apps (and protocols)
that Use Unicode (I’ve blogged about that too).

Like here:
https://blogs.msdn.microsoft.com/shawnste/2009/06/01/writing-fields-of-data-to-an-encoded-file/

You’re showcasing that despite “The moral here is ‘Use Unicode’ ” some people
are still not using it. The stuff gets even weirder as you state that code
pages and Unicode are not 1:1, contradicting the Unicode design principle of
roundtrip compatibility.

The point in not using Unicode, and likewise in not using verbose formats, is
limited hardware resources. Often new implementations are built on top of old
machines and programs, for example in the energy and shipping industies. This
poses a security threat, ending up in power outages and logistic breakdowns.
That is making our democracies vulnerable. Hence maintaining obsolete systems
does not pay back. We’re all better off when recycling all the old hardware and
investing in latest technologies, implementing Unicode by the way.

What you are advocating in this thread seems like a non-starter.

However, the fact that an app may run very poorly in Cherokee or whatever
doesn’t mean that there aren’t a bunch of French enterprises that depend on
that app for their day-to-day business.

They’re ill-advised in doing so (see above).

In order for the “unicode-only” locale users to use those apps, the app would
need to be updated, or another app with the appropriate functionality would
need to be selected.

To be “selected”, not developed and built. The job is already done. What are
people waiting for?

However, that still doesn’t impact the current French users that are “ok” with
their current non-Unicode app. Yes, I would encourage them to move to Unicode,
however they tend to not want to invest in migration when they don’t see an
urgent need.

They may not see it because they’re lacking appropriate training in cyber
security. You seem to be backing that unresponsive behavior. I can’t see that
you may be doing any good by doing so, and I’d strongly advise you to reach out
to your customers, or check the issue with your managers. We’re in a time where
companies are still making huge benefits, and it is unclear where all that
money goes once paid out to shareholders. The money is there, you only need to
market the security. That job would better use your time than tampering with
legacy apps.

Since Windows depends on CLDR and ICU data, updates to that data means that
those customers can experience pain when trying to upgrade to newer versions of
Windows. We get those support calls, they don’t tend to pester CLDR.

Am I pestering CLDR…

Keeping CLDR in synch is just the right way to go.

Since we’re on it: Do you have any hints about why some powerful UTC members
seem to hate NNBSP in French?
I’m mainly talking about French punctuation spacing here.

Which is why I suggested an “opt-in” alt form that apps wanting “civilized”
behavior could opt-into (at least for long enough that enough badly behaved
apps would be updated to warrant moving that to the default.)

Asmus Freytag’s proposal seems better:

“having information on "common fallbacks" would be useful. If formatting
numbers, I may be free to pick the "best",
but when parsing for numbers I may want to know what deviations from
"best" practice I can expect.”

Because if you let your customers “opt in” instead of urging them to update,
some will never opt in, given they’re not even ready to care about cyber
security.

The data for

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:

On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:

Marcel,

about your many detailed *technical* questions about the history of character
properties, I am afraid I have no specific recollection.

Other List Members are welcome to join in, many of whom are aware of how things
happened. My questions are meant to be rather simple. Summing up the premium
ones:

1. Why does UTC ignore the need of a non-breakable thin space?
2. Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with
proportional advance width were used to write books ready for print.

Another question you do answer below:

French is not the only language that uses a space to group figures. In fact, I
grew up with thousands separators being spaces, but in much of the existing
publications or documents there was certainly a full (ordinary) space being
used. Not surprisingly, because in those years documents were typewritten and
even many books were simply reproduced from typescript.

When it comes to figures, there are two different types of spaces.

One is a space that has the same width a digit and is used in the layout of lists. For example, if
you have a leading currency symbol, you may want to have that lined up on the left and leave the
digits representing the amounts "ragged". You would fill the intervening spaces with this
"lining" space character and everything lines up.

That is exactly how I understood hot-metal typesetting of tables. What
surprises me is why computerized layout does work the same way instead of using
tabulations and appropriate tab stops (left, right, centered, decimal [with all
decimal separators lining up vertically).

==> At the time Unicode was first created (and definitely before that, during the time of
non-universal character sets) many applications existed that used a "typewriter
model" and worked by space fill rather than decimal-point tabulation.

If you are talking about applications, as opposed to typesetting tables for
book printing, then I’d suggest that the fixed-width display of tables could be
done much like still today’s source code layout, where normal space is used for
that purpose. In this use case, line wrap is typically turned off. That could
make non-breakable spaces sort of pointless (but I’m aware of your point
below), except if people are expected to re-use the data in other environments.
In that case, best practice is to use NNBSP as thousands separator while
displaying it like other monospace characters. That’s at least how today’s
monospace fonts work (provided they’re used in environments actually supporting
Unicode, which may not happen with applications running in terminal).

From today's perspective that older model is inflexible and not the best
approach, but it is impossible to say how long this legacy approach hung on in
some places and how much data might exist that relied on certain long-standing
behaviors of these space characters.

My position since some time is that legacy apps should use legacy libraries.
But I’ll come back on this when responding to Shawn Steele.

For a good solution, you always need to understand

(1) the requirement of your "index" case (French, in this case)

That’s okay.

(2) how it relates to similar requirements in (all!) other languages / scripts

That’s rather up to CLDR as I suggested, given it has the means to submit a
point to all vetters. See again below (in the part that you’ve cut off without
consideration).

(3) how it relates to actual legacy practice

That’s Shawn Steele’s point (see next reply).

(3a) what will suddenly no longer work if you change the properties on some
character

(3b) what older data will no longer work if the effective behavior of newer
applications changes

I’ll already note that this needs to be aware of actual use cases and/or to
delve into the OSes, that is far beyond what I can currently do, both wrt time
and wrt resources. The vetter’s role is to inform CLDR with correct data from
their locale. CLDR is then welcome to sort things out and to get in touch with
the industry, which CLDR TC is actually doing. But that has no impact on the
data submitted at survey time. Changing votes to tell “OK let the group
separator be NBSP as long as…” would be a lie.

In lists like that, you can get away with not using a narrow thousands
separator, because the overall context of the list indicates which digits
belong together and form a number. Having a narrow space may still look nicer,
but complicates the space fill between the symbol and the digits.

It does not, provided that all numbers have thousands separators, even if
filling with spaces. It looks nicer because it’s more legible.

Now for numbers in running text using an o

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 23:46, Shawn Steele wrote:

*>> *Keeping these applications outdated has no other benefit than providing a 
handy lobbying tool against support of NNBSP.

I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).

If they are obsolete apps, they don’t use CLDR / ICU, as these are designed for 
up-to-date and fully localized apps. So one hassle is off the table.

Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.

I didn’t look into these date interchanges but I suspect they won’t use any 
thousands separator at all to interchange data. The group separator is only for 
display and print, and there you may wish to use a compat library for obsolete 
apps, and a newest library for apps with Unicode support. If an app is so 
obsolete it will keep working without new data from ICU.

This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.

Again I don’t believe that apps are storing numbers with thousands separators 
in them. Not even spreadsheet software does do that. I say not even because 
these are high-end apps with latest locale data expected.

Sorry you did skip this one:

>> What are all these expected to do while localized with scripts outside 
Windows code pages?

Indeed that is the paradox, that Tirhuta users are entitled to use correct 
display with newest data, while Latin users are bothered indefinitely with old 
data and legacy display.

>> Also when you need those apps, just tailor your French accordingly.

Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.

Not the user. I’m addressing your concerns as coming from the developer side. I 
meant you should use the data as appropriate, and if a character is beyond 
support, just replace it for convenience.

>> That should not impact all other users out there interested in a civilized 
layout.

I’m not sure that the choice of the word “civilized” adds value to the 
conversation.

That is to express in a mouthful of English what user feedback is or can be, 
even if not all the time. Users are complaining about quotation marks spaced 
off too far when typeset with NBSP like Word does. It’s really ugly they say. 
NBSP is a character with precise usage, it’s not a one-size-fits-all. BTW as 
you are in the job, why does Word not provide an option with a checkbox letting 
the user set the space as desired? NBSP or NNBSP.

  We have pretty much zero feedback that the OS’s French formatting is 
“uncivilized” or that the NNBSP is required for correct support.

That is, at some point users stop submitting feedback when they see of how 
little use it is spending time to post it. From the pretty much zero you may 
wish to pick the one or two you get, guessing that for one you get there are 
one thousand other users out there having the same feedback but not submitting 
it. One thousand or one million, it’s hard to be precise…

>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
for.

For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.

You dont need that field in CLDR. Here’s how it works: Take the locale data, 
search-and-replace all NNBSP with NBSP, and here’s the library you’ll use.
Because NNBSP is not only in the group separator. I’d suggest to download 
common/main/fr.xml and check all instances of NNBSP. The legacy apps you’re 
referring to don’t use that data for sure. That data is for fine high-end apps 
and for user interfaces of Windows and any other OS. If you want your employer 
be well-served, you’d rather prefer the correct data, not legacy fallbacks.

Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not really the best practice, particularly 
in

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode


On 18/01/2019 22:03, Shawn Steele via Unicode wrote:


I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).

13/01/2019  09:48    15?360 AcXtrnal.dll

13/01/2019  09:46    54?784 AdaptiveCards.dll

13/01/2019  09:46    67?584 AddressParser.dll

13/01/2019  09:47    24?064 adhapi.dll

13/01/2019  09:47    97?792 adhsvc.dll

10/04/2013  08:32   154?624 AdjustCalendarDate.exe

10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb

13/01/2019  10:47   534?016 AdmTmpl.dll

13/01/2019  09:48    58?368 adprovider.dll

13/01/2019  10:47   136?704 adrclient.dll

13/01/2019  09:48   248?832 adsldp.dll

13/01/2019  09:46   251?392 adsldpc.dll

13/01/2019  09:48   101?376 adsmsext.dll

13/01/2019  09:48   350?208 adsnt.dll

13/01/2019  09:46   849?920 adtschema.dll

13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.


Keeping these applications outdated has no other benefit than providing a handy 
lobbying tool against support of NNBSP. What are all these expected to do while 
localized with scripts outside Windows code pages?

Also when you need those apps, just tailor your French accordingly. That should 
not impact all other users out there interested in a civilized layout, that we 
cannot get with NBSP, as this is justifying and numbers are torn apart in 
justified layout, nor with FIGURE SPACE as recommended in UAX#14 because it’s 
too wide and has no other benefit. BTW figure space is the same question mark 
in Windows terminal I guess, based on the above.

As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is for. 
Any legacy program can always use downgraded data, you can even replace NBSP if 
the expected output is plain ASCII. Downgrading is straightforward, the reverse 
is not true, that is why vetters are working so hard during CLDR surveys. CLDR 
data is kind of high-end, that is the only useful goal. Again downgrading is 
easy, just run a tool on the data and the job is done. You’ll end up with two 
libraries instead of one, but at least you’re able to provide a good UX in 
environments supporting any UTF.

Best,

Marcel

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

Covering existing character sets (National, International and Industry) was _an_ (not
"the") important goal at the time: such coverage was understood as a necessary
(although not sufficient) condition that would enable data migration to Unicode as well
as enable Unicode-based systems to process and display non-Unicode data (by conversion).

I’d take this as a touchstone to infer that there were actual data files
including standard typographic spaces as encoded in U+2000..U+2006, and
electronic table layout using these: “U+2007 figure space has a fixed width,
known as tabular width, which is the same width as digits used in tables.
U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?

May I remind you that the beginnings of Unicode predate the development of the world wide
web. By 1993 the web had developed to where it was possible to easily access material
written in different scripts and language, and by today it is certainly possible to
"sample" material to check for character usage.

When Unicode was first developed, it was best to work from the definition of
character sets and to assume that anything encoded in a give set was also used
somewhere. Several corporations had assembled supersets of character sets that
their products were supporting. The most extensive was a collection from IBM.
(I'm blanking out on the name for this).

These collections, which often covered international standard character sets as
well, were some of the prime inputs into the early drafts of Unicode. With the
merger with ISO 10646 some characters from that effort, but not in the early
Unicode drafts, were also added.

The code points from U+2000..U+2008 are part of that early collection.

Note, that prior to Unicode, no character set standard described in detail how
characters were to be used (with exception, perhaps of control functions).
Mostly, it was assumed that users knew what these characters were and the
function of the character set was just to give a passive enumeration.

Unicode's character property model changed all that - but that meant that
properties for all of the characters had to be determined long after they were
first encoded in the original sources, and with only scant hints of the
identity of what these were intended to be. (Often, the only hint was a
character name and a rather poor bitmapped image).

If you want to know the "legacy" behavior for these characters, it is more useful,
therefore, to see how they have been supported in existing software, and how they have been used in
documents since then. That gives you a baseline for understanding whether any change or
clarification of the properties of one of these code points will break "existing
practice".

Breaking existing practice should be a dealbreaker, no matter how
well-intentioned a change is. The only exception is where existing
implementations are de-facto useless, because of glaring inconsistencies or
other issues. In such exceptional cases, deprecating some interpretations of
character may be a net win.

However, if there's a consensus interpretation of a given character the you can't just go
in and change it, even if it would make that character work "better" for a
given circumstance: you simply don't know (unless you research widely) how people have
used that character in documents that work for them. Breaking those documents
retroactively, is not acceptable.

That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs
to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the
*MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break
for example those implementations relying on Gc=Zs for the purpose of applying
a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use
case of NNBSP: between an integer and a vulgar fraction, pointing an error in
TUS version 8.0 by the way: “the THIN SPACE does not prevent line breaking from
occurring, which is required in style guides such as the Chicago Manual of
Style”. ― In version 11.0 the erroneous part is still uncorrected: “If the
fraction is to be separated from a previous number, then a space can be used,
choosing the appropriate width (normal, thin, zero width, and so on). For
example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.” Note
that TUS has typeset this with the precomposed U+00BE, not with plain digits
and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break
property from A to GL does not break any implementation nor document. As of
possible misuse of the character in ways other than intended, generally there
is no point in using as br

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 18/01/2019 19:02, Asmus Freytag via Unicode wrote:

On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:

I understand only better why a significant majority of UTC is hating French.

Francophobia is also palpable in Canada, beyond any technical reasons,
especially in the IT industry. Hence the position of UTC is far from isolated.
If ethic and personal considerations inflect decision-making, they should
consistently be an integral part of discussions here. In that vein, I’d mention
that by the time when Unicode was developed, there was a global hatred against
France, that originated in French colonial and foreign politics since WWII, and
was revived a few years ago by the French government sinking 푅푎푖푛푏표푤 푊푎푟푟푖표푟
and killing the crew’s photographer, in the port of Auckland. That crime
triggered a peak of anger.

Again, my recollections do *not support* any issues of _Francophobia_.

The Unicode Technical committee has always had French people on board, from the
beginning, and I have witnessed no issues where they took up a different
technical position based on language. Quite the opposite, the UTC generally
appreciates when someone can provide native insights into the requirements for
supporting a given language. How best to realize these requirements then
becomes a joint effort.

If anything, the Unicode Consortium saw itself from the beginning in contrast
to an IT culture for which internationalization at times was still something of
an afterthought.

Given all that, I find your suggestions and implications deeply hurtful and
hope you will find a way to avoid a repetition in the future.

May I suggest that trying to rake over the past and apportion blame is
generally less productive than _moving forward _and addressing the outstanding
problems.

It is my last-resort track that I’m deeply convinced of. But I’m thankfully
eased by not needing to discuss it here further.

To point a well-founded behavior is not to blame. You’ll note that I carefully
founded how UTC was right in doing so if they did. I wasn’t aware that I was
hurtful. You tell me, so I apologize. Please note, though, based on my past
e‑mail, that I see UTC as a compound of multiple, sometimes antagonistic
tendencies. Just an example to help understand what I mean: When Karl Pentzlin
proposed to encode a missing French abbreviation indicator, a typographer was
directed to argue (on behalf of his employer IIUC) that this would be a case of
encoding all scripts in bold and italic. The OP protested that it wasn’t, but
he was overheard. That example raises much concern, the more as we were told on
this List that decision makers in UTC are refusing to join in open and public
discussions here, are only “duelling ballot comments.”

Now since regardless of being right in doing so, they did not at all, I’m
plunged again into disarray. May I quote Germaine Tillion, a French ethnologue:
It’s important to understand what happens to us; to understand is to exist. ―
Originally, “to exist” meant “to stand out.” That is still somewhat implied in
the strong sense of “to exist.” Understanding does also help to overcome.
That’s why I wrote one e‑mail before:

Nothing happens, or does not happen, without a good reason.
Finding out what reason is key to recoverage.
If we want to get what we need, we must do our homework first.

Thanks for helping bring it to the point.

Kind regards,

Marcel

Re: NNBSP

2019-01-18 Thread Marcel Schneider via Unicode

On 17/01/2019 20:11, 梁海 Liang Hai via Unicode wrote:

[Just a quick note to everyone that, I’ve just subscribed to this public list,
and will look into this ongoing Mongolian-related discussion once I’ve mentally
recovered from this week’s UTC stress. :)]

Welcome to Unicode Public.

Hopefully this discussion helps sort things out so that we’ll know both what to
do wrt Mongolian and what to do wrt French.

On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode mailto:unicode@unicode.org>> wrote:

On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:

[On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:]

[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian
was discussed for encodinc in the UCS. The problem is that the initial rush for French
was made in a period where Unicode and ISO were competing and not in sync, so no
agreement could be found, until there was a decision to merge the efforts. Tge early rush
was in ISO still not using any character model but a glyph model, with little desire to
support multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to unify the
existing vendor character sets which were already implemented by a limited set of
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still
using another unrelated technology). Font standards were still not existing and were
competing in incompatible ways, all was a mess at that time, so publishers were still
required to use proprietary software solutions, with very low interoperability (at that
time the only "standard" was PostScript, not needing any character encoding at
all, but only encoding glyphs!)

Thank you for this insight. It is a still untold part of the history of Unicode.

This historical summary does *not *square in key points with my own
recollection (I was there). I would therefore not rely on it as if gospel truth.

In particular, one of the key technologies that _brought industry partners to
cooperate around Unicode_ was font technology, in particular the development of
the /TrueType /Standard. I find it not credible that no typographers were part
of that project :).

It is probably part of the (unintentional) fake blames spread by the cited
author’s paper. My apologies for not sufficiently assessing the reliability of
my sources. I’d already identified a number of errors but wasn’t savvy enough
for seeing the other one reported by Richard Wordingham. Now the paper ends up
as a mere libel. It doesn’t mention the lack of NNBSP, instead it piles up a
bunch of gratuitous calumnies. Should that be the prevailing mood of average
French professionals with respect to Unicode ― indeed Patrick Andries is the
only French tech writer on Unicode I found whose work is acclaimed, the others
are either disliked or silent (or libellists) ― then I understand only better
why a significant majority of UTC is hating French.

The statement: "there was initially no desire to encode all the languages and
scripts" is categorically false.

Though Unicode was designed as being limited to a 65 000 characters, and it was
stated that historic scripts were out of scope, only livi

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode


On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS. The problem is that the initial rush for French 
was made in a period where Unicode and ISO were competing and not in sync, so no 
agreement could be found, until there was a decision to merge the efforts. Tge early rush 
was in ISO still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no desire to 
encode all the languages and scripts, focusing initially only on trying to unify the 
existing vendor character sets which were already implemented by a limited set of 
proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of 
the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national 
standard or de facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)


Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they 
have no computer science training, and because they were feared as trying to 
enforce requirements that Unicode were neither able nor willing to meet, such 
as distinct code points for italics, bold, small caps…

Among the grievances, Unicode is blamed for confusing Greek psili and dasia 
with comma shapes, and for misinterpreting Latin letter forms such as the u 
with descender taken for a turned h, and double u mistaken for a turned m, 
errors that subsequently misled font designers to apply misplaced serifs. 
Things were done in a hassle and a hurry, under the Damokles sword of a hostile 
ISO messing and menacing to unleash an unusable standard if Unicode wasn’t 
quicker.


If publishers had been involded, they would have revealed that they all needed 
various whitespaces for correct typography (i.e. layout). Typographs themselves 
did not care about whitespaces because they had no value for them (no glyph to 
sell).


Nevertheless the whole range of traditional space forms was admitted, despite 
they were going to be of limited usability. And they were given properties.
Or can’t the misdefinition of PUNCTUATION SPACE be backtracked to that era?


Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like 
Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required 
us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, 
dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained 
their typists or ads sellers to use it (that character was not "sold" in classified ads, 
it was necessary for correct layout, notably in narrow columns, not using it confused the readers 
(notably for the ":" colon): it had to be non-breaking, non-expanding by justification, 
narrower than digits and even narrower than standard non-justified whitespace, and was consistently 
used as a decimal grouping separator.


No doubt they were confident that when an UCS is set up, such an important 
character wouldn’t be skipped.
So confident that they never guessed that they had a key role in reviewing, in 
providing feedback, in lobbying.
Too bad that we’re still so few people today, corporate vetters included, 
despite much things are still going wrong.


But at that time the most common OSes did not support it natively because there 
was no vendor charset supporting it (and in fact most OSes were still unable to 
render proportional fonts everywhere and were frequently limited to 8-bit 
encodings (DOS, Windows, Unix(es), and even Linux at its early start).


Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren’t ready. Not 
even the NNBSP.

Perhaps it’s the poetic ‘justice of time’ that since Unicode is on, the 
Vietnamese are the foremost, and the French the hindmost.
[I’m alluding to the early lobbying of Vietnam for a comprehensive set of 
precomposed letters, while French wasn’t even granted to come into the benefit 
of the NNBSP – that according to PRI #308 [1] is today the only known use of 
NNBSP outside Mongolian – and a handful ordinal indicators (possibly along with 
the rest of the alphabet, except q).

[1] “The only other widely noted use for U+202F NNBSP is for representation of the 
thin non-breaking space (/espace fine

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode


On 17/01/2019 14:36, I wrote:

[…]
The only thing that searches have brought up


It was actually the best thing. Here’s an even more surprising hit:

   B. In the rules, allow these characters to bridge both 
alphabetic and numeric words, with:

 * Replace MidLetter by (MidLetter | MidNumLet)
 * Replace MidNum by (MidNum | MidNumLet)


   -

   4. In addition, the following are also sometimes used, or could 
be used, as numeric separators (we don't give much guidance as to the best 
choice in the standard):

   |0020 |( ) 
SPACE
   |00A0 |(   
) NO-BREAK SPACE
   |2007 |(   
) FIGURE SPACE
   |2008 |(   
) PUNCTUATION SPACE
   |2009 |(   
) THIN SPACE
   |202F |(   
) NARROW NO-BREAK SPACE

   If we had good reason to believe that if one of these only 
really occurred between digits in a single number, then we could add it. I 
don't have enough information to feel like a proposal for that is warranted, 
but others may. Short of that, we should at least document in the notes that 
some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn’t Unicode wish to sort out 
which one of these is the group separator?

1. SPACE: is breakable, hence exit.
2. NO-BREAK SPACE: is justifying, hence exit.
3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
4. PUNCTUATION SPACE: has been left breakable against all reason and evidence 
and consistency, hence exit…
5. THIN SPACE: is part of the breakable spaces series, hence exit.
6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey 35 
all is questioned again, must be assessed, may impact implementations, while 
all other locales using space are still impacted by bad display using NO-BREAK 
SPACE.

I know we have another public Mail List for that, but I feel it’s important to 
submit this to a larger community for consideration and eventually, for 
feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode


On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:


[quoted mail]

But the French "espace fine insécable" was requested long long before Mongolian 
was discussed for encodinc in the UCS.


Then we should be able to read its encoding proposal in the UTC document 
registry, but Google Search seems unable to retrieve it, so there is a big risk 
that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX #14 
that I’ve quoted in the parent thread has been added by a Unicode Technical 
Director not mentioned in the author field, and that he did it on request from 
two gentlemen whose first names only are cited. I’m sure their full names are 
Martin J. Dürst and Patrick Andries, but I may be wrong.

I apologize for the comment I’ve made in my e‑mail. Still it would be good to 
learn why the French use of NNBSP is sort of taken with a grain of salt, while 
all involved parties were knowing that this NNBSP was (as it still is) the only 
Unicode character ever encoded able to represent the so-long-asked-for “espace 
fine insécable.”

There is also another question I’m asking since a while: Why the character U+2008 
PUNCTUATION SPACE wasn’t given the line break property value "GL" like its 
sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as “2007-08-08”. Why was the Core 
Specification not updated in sync, but only a 7 years later? And was Unicode 
aware that this whitespace is hated by the industry to such an extent that a 
major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to 
beg for font support?


Regards,

Marcel


The problem is that the initial rush for French was made in a period where 
Unicode and ISO were competing and not in sync, so no agreement could be found, 
until there was a decision to merge the efforts. Tge early rush was in ISO 
still not using any character model but a glyph model, with little desire to 
support multiple whitespaces; on the Unicode side, there was initially no 
desire to encode all the languages and scripts, focusing initially only on 
trying to unify the existing vendor character sets which were already 
implemented by a limited set of proprietary vendor implementations (notably 
IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA 
including the existing ISO 8859-*, GBK, and some national standard or de facto 
standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this time but still 
using another unrelated technology). Font standards were still not existing and were 
competing in incompatible ways, all was a mess at that time, so publishers were still 
required to use proprietary software solutions, with very low interoperability (at that 
time the only "standard" was PostScript, not needing any character encoding at 
all, but only encoding glyphs!)

If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
and was consistently used as a decimal grouping separator.


But at that time the most common OSes did not support it natively because there was no 
vendor charset supporting it (and in fact most OSes were still unable to render 
proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, 
Windows, Unix(es), and even Linux at its early start). So intermediate solution was 
needed. Us chose not to use at all the non-breakable thin space because in English it was 
not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for 
everything (but including its own national symbol for the "$", competing with 
other ISO 646 variants). There were tons of legacy applications developed ince decenials 
that did not support anything else and interoperability in US was available ony with 
ASCII, everything else was unreliable.

If you remember the early years when the Internet started to develop outside US, you remember the

Re: NNBSP

2019-01-17 Thread Marcel Schneider via Unicode


On 17/01/2019 09:58, Richard Wordingham wrote:


On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode  wrote:


Also, at least one French typographer was extremely upset
about Unicode not gathering feedback from typographers.
That blame is partly wrong since at least one typographer
was and still is present in WG2, and even if not being a
Frenchman (but knowing French), as an Anglophone he might
have been aware of the most outstanding use case of NNBSP
with English (both British and American) quotation marks
when a nested quotation starts or ends a quotation, where
_‘ ”_ or _“ ’_ and _’ ”_ or _” ’_ are preferred over the
unspaced compounds (_‘”_ or _“’_ and _’”_ or _”’_), at
least with proportional fonts.


There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.


Thanks, I didn’t know that this is already implemented. Sometimes one can
read in discussions that the issue is dismissed to font level. That looked
always utopistic to me, the more as people are trained to type spaces when
bringing in former typewriting expertise, and I always believed that it’s
a way for helpless keyboard layout designers to hand the job over.

Turns out there is more to it. But the high-end solution notwithstanding,
the use of an extra space character is recommended practice:

https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html

The source sums up in an overview: “_The Associated Press Stylebook_
recommends a thin space, whereas _The Gregg Reference Manual_ promotes a
full space between the quotation marks. _The Chicago Manual of Style_ says
no space is necessary but adds that a space or a thin space can be inserted
as ‘a typographical nicety.’ ” The author cites three other manuals for not
having retrieved any locus about the topic in them.

We note that all three style guides seem completely unconcerned with
non-breakability. Not so the author of the blog post: “[…] If your software
moves the double quotation mark to the next line of type, use a nonbreaking
space between the two marks to keep them together.” Certainly she would
recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard
or if the software provided a handy shortcut by default.



This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)


Another drawback is that most environments don’t provide OpenType support,
and that the whole scheme depends on language tags that could easily got
lost, and that the issue as being particular to French would quickly boil
down to dismiss support as not cost-effective, arguing that *if* some
individual locale has special requirements for punctuation layout, its
writers are welcome to pick an appropriate space from the UCS and key it
in as desired.

The same is also observed about Mongolian. Today, the preferred approach
for appending suffixes is to encode a Mongolian Suffix Connector to make
sure the renderer will use correct shaping, and to leave the space to the
writer’s discretion. That looks indeed much better than to impose a hard
space that unveiled itself as cumbersome in practice, and that is
reported to often get in the way of a usable text layout.

The problems related to NNBSP as encountered in Mongolian are completely
absent when NNBSP is used with French punctuation or as the regular
group separator in numbers. Hence I’m sure that everybody on this List
agrees in discouraging changes made to the character properties of NNBSP,
such as switching the line breaking class (as "GL" is non-tailorable), or
changing general category to Cf, which could be detrimental to French.

However we need to admit that NNBSP is basically not a Latin but a
Mongolian space, despite being readily attracted into Western typography.
A similar disturbance takes place in word processors, where except in
Microsoft Word 2013, the NBSP is not justifying as intended and as it is
on the web. It’s being hacked and hijacked despite being a bad compromise,
for the purpose of French punctuation spacing. That tailoring is in turn
very detrimental to Polish users, among others, who need a justifying
no-break space for the purpose of prepending one-letter prepositions.

Fortunately a Polish user found and shared a workaround using the string
, the latter being still used in lieu of WORD JOINER as
long as Word keeps unsupporting latest TUS (an issue that raised concern
at Microsoft when it was reported, and will probably be fixed or has
already been fixed meanwhile).



Another spacing m

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Marcel Schneider via Unicode


Courier New was lacking NNBSP on Windows 7. It is including it on
Windows 10. The tests I referred to were made 2 years ago. I
confess that I was so disappointed to see Courier New unsupporting
NNBSP a decade after encoding, while many relevant people in the
industry were surely aware of its role and importance for French
(at least those keeping a branch office in France), that I gave it
up. Turns out that foundries are delaying support until the usage
is backed by TUS, which happened in 2014, timely for Windows 10.
(I’m lacking hints about Windows 8 and 8.1.)

Superscripts are a handy parallel showcasing a similar process.
As long as preformatted superscripts are outlawed by TUS for use
in the digital representation of abbreviation indicators, vendors
keep disturbing their glyphs with what one could start calling an
intentional metrics disorder (IMD). One can also rank the vendors
on the basis of the intensity of IMD in preformatted superscripts,
but this is not the appropriate thread, and anyhow this List is
not the place. A comment on CLDR ticket #11653 is better.

[…]

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. […]

Marcel

Re: Encoding italic (was: A last missing link)

2019-01-16 Thread Marcel Schneider via Unicode


On 17/01/2019 07:36, David Starner via Unicode wrote:
[…]

On the other hand, most people won't enter anything into a tweet they can't 
enter from their keyboard, and if they had to, would resort to cut and paste. 
The only people Unicode italics could help without change are people who 
already can use mathematical italics. If you don't have buy-in from systems 
makers, people will continue to lack practical access to italics in plain text 
systems.


Yes that is the point here, and that’s why I wasn’t proposing anything else 
than we can input right from the current keyboard layout. For italic plain text 
we would need a second keyboard layout or some corresponding feature, and 
switch back and forth between the two. It’s feasible, at least for a wide 
subset of Latin locales, but it’s an action similar to changing the type wheel 
or the ball-head.

Now thankfully the word is out.

Best regards,

Marcel

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-16 Thread Marcel Schneider via Unicode


On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:


On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode  wrote:


If your fonts behave incorrectly on your system because it does not
map any glyph for NNBSP, don't blame the font or Unicode about this
problem, blame the renderer (or the application or OS using it, may
be they are very outdated and were not aware of these features, theyt
are probably based on old versions of Unicode when NNBSP was still
not present even if it was requested since very long at least for
French and even English, before even Unicode, and long before
Mongolian was then encoded, only in Unicode and not in any known
supported legacy charset: Mongolian was specified by borrowing the
same NNBSP already designed for Latin, because the Mongolian space
had no known specific behavior: the encoded whitespaces in Unicode
are compeltely script-neutral, they are generic, and are even
BiDi-neutral, they are all usable with any script).


The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.


Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
more. When Unicode argued in favor of a unification with , this was
pointed as impracticable, and the need of a specific Mongolian space for
the purpose of appending suffixes was underscored. Only in London in
September 1998 it was agreed that “The Mongolian Space is retained but
moved to the general punctuation block and renamed ‘Narrow No Break Space’ ”.

However, unlike for the Mongolian Combination Symbols sequencing a question
and exclamation mark both ways, a concrete rationale as of how useful the
 could be in other scripts doesn’t seem to be put on the table when
the move to General Punctuation was decided.



Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:


As a side-note: The relevant text of TUS doesn’t predate version 11 (2018).



1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.


I don’t believe that these additions to TUS are in any way able to fix
the many issues with  in Mongolian causing so much headache and
ending up in a unanimous desire to replace  with a *new*
*MONGOLIAN SUFFIX CONNECTOR. Indeed some suffixes are as long as 7 letters,
e.g. “ ᠲᠠᠶᠢᠭᠠᠨ ”

https://lists.w3.org/Archives/Public/public-i18n-mongolian/2015JulSep/att-0036/DS05_Mongolian_NNBSP_Connected_Suffixes.pdf



Or is property (3) appropriate for French?


No it isn’t. It only introduces new flaws for a character that,
despite being encoded for Mongolian with specific handling intended,
was readily ripped off for use in French, Philippe Verdy reported,
to that extent that it is actually an encoding error in Mongolian
that brought the long-missing narrow non-breakable thin space into
the UCS, in the block where it really belongs to, and where it had
been encoded in the beginning if there had been no desire to keep
it proprietary.

That is the hidden (almost occult) fact where stances like “The
NNBSP can be used to represent the narrow space occurring around
punctuation characters in French typography, which is called an
‘espace fine insécable.’ ” (TUS) and “When NARROW NO-BREAK SPACE
occurs in French text, it should be interpreted as an ‘espace fine
insécable’.” (UAX #14) are stemming from. The underlying meaning
as I understand it now is like: “The non-breakable thin space is
usually a vendor-specific layout control in DTP applications; it’s
also available via a TeX command. However, if you are interested
in an interoperable representation, here’s a Unicode character you
can use instead.”

Due to the way  made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I’m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. I’ll need to use
a font manager to output a complete list wrt NNBSP support.

I’m utterly worried about the fate of the non-breaking thin
space in Unicode, and I wonder why the French and Canadian
French people present at setup – either on Unicode side or
on JTC1/SC2/WG2 side – didn’t get this character encoded in
the initial rush. Did they really sell themselves and their
locales to DTP lobbyists? Or were they tricked out?

Also, at least one

Re: wws dot org

2019-01-16 Thread Marcel Schneider via Unicode


On 15/01/2019 19:22, Johannes Bergerhausen via Unicode wrote:

Dear list,

I am happy to report that www.worldswritingsystems.org 
 is now online.

The web site is a joint venture by

— Institut Designlabor Gutenberg (IDG), Mainz, Germany,
— Atelier National de Recherche Typographique (ANRT), Nancy, France and
— Script Encoding Initiative (SEI), Berkeley, USA.

For every known script, we researched and designed a reference glyph.

You can sort these 292 scripts by Time, Region, Name, Unicode version and 
Status.
Exactly half of them (146) are already encoded in Unicode.

So to date, Unicode has only made half its way, and for every single script in 
the
Standard there is another script out there that remains still unsupported.

First things first. When I first replied in the first thread of this year I 
already
warned:
>>> Having said that, still unsupported minority languages are top priority.

I didn’t guess that I opened a Pandora box whose content would lead us
far away from the only useful goal deeply embedded in the concept of
Unicode: support all of the world’s writing systems.

Instead, we’re discussing how to enable social media users to tune
ephemeral messages even more to attract even more the scarce attention
of overwhelmed co-users before going buried in the mass of a vanishing timeline.

I sought feedback about using Unicode to get back the underlining feature known
from the typewriter era. But like some other hints I provided, it went unpicked…

Sadly it’s uninteresting, no cherries. Also if Unicode had to wait until enough
characters are picked for adoption prior to encoding the missing scripts, I’m
afraid the job won’t ever be done…

The industry is welcome to help speeding up the process.

Thanks to Johannes Bergerhausen for setting up and sharing this resource.

Best regards,

Marcel

Re: Encoding italic

2019-01-16 Thread Marcel Schneider via Unicode


On 16/01/2019 06:05, David Starner via Unicode wrote:
[…]

[…] There's no one here regards plain text with derision, disdain or contempt.

There is one sort of so-called plain text that looks unbearable to me. That is the 
draft-style plain text full of ASCII fallbacks. Especially those where Latin 
abbreviation indicators that are correctly superscript, are sitting on the baseline. 
Also those using ASCII space or Latin-1 non-breakable space to space off French 
punctuation, and where those marks are then cut off by line breaks, or torn apart by 
justification when such plain text is the backbone of rich text on the web (where 
 remains unhacked, unlike in word processors where it’s fixed-width, and 
even then it’s ugly).


[…] Dismissing the people who use Unicode in ways that aren't plain text is 
unfair […].

Is this statement applying the restrictive house policy about what is “ordinary (plain) 
text” as it is found in TUS? I’m asking the question because even if this statement is a 
mark of support and empathy, I’m uncomfortable with the idea that there seems to be a 
subset of Unicode that despite being plain text by definition, cannot be used in every 
plain text string. Please feel free to post your definition of "plain text". I 
feel that it will add to the collection.

Best regards,

Marcel

Re: Encoding italic

2019-01-15 Thread Marcel Schneider via Unicode


On 16/01/2019 02:15, James Kass via Unicode wrote:


Enabling plain-text doesn't make rich-text poor.

People who regard plain-text with derision, disdain, or contempt have
every right to hold and share opinions about what plain-text is *for*
and in which direction it should be heading.  Such opinions should
receive all the consideration they deserve.


Perhaps there’s a need to sort out what plain text is thought to be
across different user communities. Sometimes “plain text” is just a
synonym for _draft style_, considering that a worker should not need
to follow any style guide, because (a) many normal keyboards don’t
enable users to do so, and (b) the process is too complicated using
mainstream extended keyboard layouts.

From this point of view, any demand to key in directly a text in a
locale’s accurate digital representation is likely to be considered
an unreachable challenge and thus, an offense.

But indeed, people are entitled not to screw down their requirements
as of what text is supposed to look like. From that POV, draft style
is unbearable, and being bound to it is then the actual offense.

The first step would then be to beef up that draft style so that it
integrates all characters needed for a fully featured representation
of a locale’s language, from curly quotes to preformatted superscript.
Unicode makes it possible, in the straight line of what was set up
in ISO/IEC 6937. The next step is to design appropriate input methods.
Today, we can even get back the u̲n̲d̲e̲r̲l̲i̲n̲e̲ that we were deprived of,
by adding an appropriate dead key or combining diacritic, but that’s
still experimental. It already works better, though, than the Unicode
Syriac abbreviation control, whose overline is *not* rendered in
Chrome on Linux, The same way, Unicode could encode a Latin italic
control, or as Victor Gaultney proposes, a Latin italic start control
and a Latin italic end control, directing the rendering engine to
pick italics instead of drawing a linie along the rest of the word.

However, the discussion about Fraktur typefaces in the parent thread
made clear that reasoning in terms of roman vs italic is not really
interoperable, because in Roman typefaces, italic is polysemic, as
it’s used both for foreign words and for stress, while in Fraktur,
stress is denoted by spacing out, and foreign words, by using roman.
That would require a start and end pair of both Latin foreign word
controls and Latin stress controls.

As we see it from here, that would be even less implemented than
the Syriac abbreviation format control. It might be considered
Unicode conformant, since it would be part of the interoperable
digital representation of Latin script using languages, and its
use could be extended to other scripts.

But that is *not* what I’m asking for. First, we aren’t writing
in Fraktur any more, at least not in France nor in any other
language using preformatted superscript abbreviation indicators.
And second, if we need a document for full-fleshed publishing,
we can use LaTeX or InDesign.

What I’m asking for is simply that people are enabled to write
in their language in a decent manner and can use that text in
any environment without postprocessing *and* without looking
downright bad.

That might please even those who are looking at draft style
with disdain.


Best regards,

Marcel

Re: A last missing link for interoperable representation

2019-01-15 Thread Marcel Schneider via Unicode

On 15/01/2019 10:24, Philippe Verdy via Unicode wrote:

Le lun. 14 janv. 2019 à 20:25, Marcel Schneider via Unicode mailto:unicode@unicode.org>> a écrit :

On 14/01/2019 06:08, James Kass via Unicode wrote:
>
> Marcel Schneider wrote,
>
>> There is a crazy typeface out there, misleadingly called 'Courier
>> New', as if the foundry didn’t anticipate that at some point it
>> would be better called "Courier Obsolete". ...
>
> 퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well.
>
> (Had to use mark-up for that “span” of a single letter in order to
> indicate the proper letter form.  But the plain-text display looks
> crazy with that HTML jive in it.)
>

I apologize for seeming to question the font name 푝푒푟 푠푒 while targeting 
only
the fact that this typeface is not updated to support the . It just
looks like the grand name is now misused to make people believe that if
**this** great font is unsupporting , it has a good reason to do so,
and we should keep people off using that “exotic whitespace” otherwise than
“intended,” ie for Mongolian. Since fortunately TUS started backing its use
in French (2014)

This is not for Mongolian and French wanted this space since long and it has a 
use even in English since centuries for fine typography.
So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was 
forgotten in the early stages of computing with legacy 8-bit encodings but it should have 
been in Unicode since the begining as its existence is proven long before the computing 
age (before ASCII, or even before Baudot and telegraphic systems). It has alsway been 
used by typographs, it has centuries of tradition in publishing. And it has always been 
recommended and still today for French for all books/papers publishers.

Many thanks for bringing this to the point. So the case is even worse as Unicode 
deliberately skipped the non-breakable thin space while thinking at encoding the whole 
range of other typographic spaces, even with duplicate encoding of en and em spaces, and 
not forgetting those old-fashioned tabular spaces and dash: figure space and dash, and 
punctuation space. In this particular context and with all that historic practice 
background, what else than malice (supposedly inspired by an unlawful and exuberant DTP 
vendor) could drive people not to define the line-breaking property value of U+2008 
PUNCTUATION SPACE as "GL", while they did define it so for U+2007 FIGURE SPACE.

Here is also the still outdated wording of UAX #14 wrt NNBSP, Mongolian and 
French:

   […] NARROW NO-BREAK SPACE is used in Mongolian. The 
MONGOLIAN VOWEL SEPARATOR acts like a NARROW NO-BREAK SPACE in its line breaking 
behavior. It additionally affects the shaping of certain vowel characters as 
described in/Section 13.5, Mongolian/, of [Unicode 
<http://www.unicode.org/reports/tr41/tr41-23.html#Unicode>].

   NARROW NO-BREAK SPACE is a narrow version of 
NO-BREAK SPACE, which has exactly the same line breaking behavior, but with a 
narrow display width. It is regularly used in Mongolian in certain grammatical 
contexts (before a particle), where it also influences the shaping of the 
glyphs for the particle. In Mongolian text, the NARROW NO-BREAK SPACE is 
typically displayed with one third the width of a normal space character.

   When NARROW NO-BREAK SPACE occurs in French text, it 
should be interpreted as an “espace fine insécable”.

“When […] it should be interpreted as […]” is a pure insult. NARROW NO-BREAK SPACE *is* 
exactly at least the French "espace fine insécable" *and* the Mongolian 
whatever-it-is-called-in-Mongolian *and* the group separator, aka triad separator, in 
*all* locales following the SI and ISO recommendation to group digits with spaces, not 
with any punctuation.

As hopefully that misleading section will be edited, here’s the link to the 
quoted version:
https://www.unicode.org/reports/tr14/tr14-41.html#DescriptionOfProperties

Also I’d like or better I need to kindly ask the knowing List Members to 
correct the following statement *if* it is wrong:

   If the Unicode Standard had been set up in an unbiased way, U+2008 
PUNCTUATION SPACE had been given the line break property value "GL".

Perhaps the following would also be true:

   If the Unicode Standard had been set up in an unbiased way, 
there would be a NARROW NO-BREAK SPACE encoded in the range U+2000..U+200F.

Thanks in advance to Philippe Verdy and any other knowing List Members for 
staying or getting in touch and (keeping) posting feedback.

I don’t edit the subject line, nor do I spin off a new thread, given when I 
lauched this one I sincerely believed that the issues with NARROW NO-BREAK 
SPACE and with preformatted supe

Re: A last missing link for interoperable representation

2019-01-15 Thread Marcel Schneider via Unicode

On 15/01/2019 03:02, Asmus Freytag via Unicode wrote:

On 1/14/2019 5:41 PM, Mark E. Shoulson via Unicode wrote:

On 1/14/19 5:08 AM, Tex via Unicode wrote:

This thread has gone on for a bit and I question if there is any more light
that can be shed.

BTW, I admit to liking Asmus definition for functions that span text being a
definition or criteria for rich text.

Me too. There are probably some exceptions or weird corner-cases, but it seems
to be a really good encapsulation of the distinction which I had never seen
before.

** blush **

A./

I did like it too, and I was really amazed that the issue could be boiled down to such a handy shibboleth. It wasn’t until I’m looking harder that I can’t help any more seeing it as a mere
rewording of current practice. That is, if we’re using markup (that typically acts on spans and other elements), it’s rich text; if we’re using characters, it’s plain text. The reason why I
changed my mind is that the new shibboleth can be misused to relegate to the realm of rich text some feature of a writing system, like using superscript as ordinal indicators (English
"3ʳᵈ", French "2ᵉ" [order] or "2ⁿᵈ" [rank], Italian "1ᵃ" or — in Latin-1 — "1ª", the latter being used in German as a narrow form of
"prima" that has special semantics there ["top quality" or "great!"]), only on the basis that it is currently emulated using rich text by declaring that
"ᵉ" is—or “should” be—a span with superscript markup, so that we end up with "2e".

As I’ve (too) slightly pointed in a previous reply, that is not what we should
end up with. Abbreviation indicators in Latin script are a case of a single
character solution, albeit multiple characters may be involved in a single
instance. We can also have inner uppercase, aka camelcase, that cannot be
handled by the titlecase attribute. We’re clearly in the realm of plain text,
and any other solution may be called an emulation, or a legacy workaround, but
not a Unicode conformant interoperable representation.

Also, please note the presence in Unicode, of U+070F SYRIAC ABBREVIATION MARK,
a format control… Probably there are also some other format controls in other
scripts, performing likely the same job. Remember when a similar solution was
suggested for Latin script on this List…

Best regards,

Marcel

Re: A last missing link for interoperable representation

2019-01-14 Thread Marcel Schneider via Unicode


On 15/01/2019 01:17, Asmus Freytag via Unicode wrote:

On 1/14/2019 2:08 PM, Tex via Unicode wrote:


Asmus,

I agree 100%. Asking where is the harm was an actual question intended to 
surface problems. It wasn’t rhetoric for saying there is no harm.


The harm comes when this is imported into rich text environments (like this 
e-mail inbox). Here, the math abuse and the styled text run may look the same, 
but I cannot search for things based on what I see. I see an English or French 
word, type it in the search box and it won't be found. I call that 'stealth' 
text.

The answer is not necessarily in folding the two, because one of the reasons for having math alphabetics is so you can 
search for a variable "a" of  certain kind without getting hits on every "a" in the text. 
Destroying that functionality in an attempt to "solve" the problems created by the alternate facsimile of 
styled text is also "harm" in some way.


That may end up in a feature request for webmails and e-mail clients, where the 
user should be given the ability to toggle between what I’d call a “Bing search 
mode” and a “Google search mode.” Google Search has extended equivalence 
classes that enable it to handle math alphabets like plain ASCII runs, i.e. we 
may type a search in ASCII and Google finds instances where the text is typeset 
“abusing” math alphabets. On the other hand, Bing Search does not have such 
extended equivalence classes, and brings up variables as they are styled when 
searching correspondingly.

I won’t blame Google of doing “harm”, and I’d like to position rather on 
Google’s side as it seems to meet the expectations of a larger part of end-user 
communities. I won’t blame Microsoft neither, I’m just noting a dividing line 
between the two vendors about handling math alphabets.

Best regards,

Marcel

Re: A last missing link for interoperable representation

2019-01-14 Thread Marcel Schneider via Unicode

On 14/01/2019 08:26, Julian Bradfield via Unicode wrote:

On 2019-01-13, Marcel Schneider via Unicode wrote:

[…]

These statements make me fear that the font you are using might unsupport
the NARROW NO-BREAK SPACE U+202F > <. If you see a question mark between

It displays as a space. As one would expect - I use fixed width fonts
for plain text.

It’s mainly that I suspected you could be using Courier New in the terminal.
It’s default for plain text in main browsers, and there are devices whose
copy of Courier New shows a .notdef box for U+202F. That’s at least what I
ɥnderstood from the feedback, and a test in my browser looked likewise.

these pointy brackets, please let us know. Because then, You’re unable to
read interoperably usable French text, too, as you’ll see double punctuation
(eg "?!") where a single mark is intended, like here !

I see "like here !".

That’s fine, your font has support for . Thanks for reporting.

The reason why I’m anxious to see that checked is that the impact on
implementations of as the group separator is being assessed.

French text does not need narrow spacing any more than science does.
When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$;
in plain text, 50cm does just fine.

By “plain text” you probably mean *draft style*. I’m thinking that
because "$50\thinspace\mathrm{cm}$" is not less plain text than "50cm".

Indeed, in not understanding that sooner I was an idiot, naively
believing that all Unicode List Members are using Unicode terminology.
Turns out that that cannot be taken for granted any more than knowing
the preferences of French people as of French text display, while not
being a Frenchman:

1. Most French people prefer that big punctunation be spaced off from
the word it pertains to.

2. Most French people strongly dislike punctuation cut off by a line
break, but cannot fix it because:
a) the ordinary keyboard layout has no non-breaking spaces;
b) the readily available on peculiar keyboard layouts
is bugging in most e-mail composers, ending up as breakable.

3. A significant part of French people strongly dislike angle quotes
that are spaced off too far, as it happens when using .

Likewise, normal French people writing email write "Quel idiot!", or
sometimes "Quel idiot !".

Normal people using normal keyboard layouts are writing with the
readily available characters most of the time. This is why (to pick
another example) French people abbreviate “numéro” to "n°", while
on a British English or an American English keyboard layout we
can’t normally expect anything else than "no", or "#" for “Number.”

We’re not trying to keep people off writing fast and draft style.
What in the Unicode era every locale is expected to achieve is to
enable normal users to get the accurate interoperable representation
of their language while typing fast, as opposed to coding in TeX,
which is like using InDesign with system spaces instead of Unicode.
System spaces are not interoperable, nor is LaTeX \thinspace if that
is non-breakable in LaTeX, which it obviously is, since it is used
to represent the thin space between a number and a measurement unit.

In Unicode, as we know it, U+2009 THIN SPACE is breakable, and the
worst thing here is that its duplicate encoding U+2008 PUNCTUATION
SPACE is breakable too, instead of being non-breakable like U+2007
FIGURE SPACE. That is why there was a need to add U+202F NARROW
NO-BREAK SPACE later. (More details in the cited CLDR ticket.)

If you google that phrase on a few French websites, you'll see that
some (such as Larousse, whom one might expect to care about such
things) use no space before punctuation,

Thanks for catching, that flaw shall be reported with link to
your email.

You may also wish to look up this page:
https://communaute.lerobert.com/forum/LE-ROBERT-CORRECTEUR/LE-ROBERT-CORRECTEUR-CORRECTION-D-ORTHOGRAPHE-DICTIONNAIRES-ET-GUIDES/Espace-entre-le-meotet-le-point-d-interrogation/2918628/398261

reading: “Le logiciel Le Robert correcteur justement signale les
espaces fines insécables si elles ne sont pas présentes sur le texte
et propose la correction.” (“Le Robert spellchecker does report
the lack of narrow no-break spaces and proposes to fix it.”)

while others (such as some
random T-shirt company) use an ASCII space.

The Académie Française, which by definition knows more about French
orthography than you do, uses full ASCII spaces before ? and ! on its
front page. Also after opening guillemets, which looks even more
stupid from an Anglophone perspective.

(See point 3 above.) That is a very good point. Indeed this website is
reasonably expected to be an example and a template of correctly
typesetting a French website. There are several reasons why actually it
is not. The main reason is that it is not the work of the A.F. itself,
but of webdesigners, webmaste

Re: A last missing link for interoperable representation

2019-01-14 Thread Marcel Schneider via Unicode


On 14/01/2019 04:00, Martin J. Dürst via Unicode wrote:
[…]

[…] As Asmus has shown, one of the best ways to understand what
Unicode does with respect to text variants is that style works on
spans of characters (words,...), and is rich text, but thinks that
work on single characters are handled in plain text. Upper-case is
definitely for most part a single-character phenomenon (the recent
Georgian MTAVRULI additions being the exception).


Obviously the single-character rule also applies to superscript when
used as ordinal indicator or more generally, as abbreviation indicator.

Thanks for the hint, it’s all about interoperability and in this case
too the point in using preformatted characters is a good one IIUC.

Sorry for getting a little off-topic. There’s also one reply on my
to-do list where I’ll do even more so; can’t help given it’s our
digital representation that’s at stake, and due to past neglect on
either side there’s still a need to painfully lobby for each
character while so many other important issues are out there…

Best Regards,

Marcel

Re: A last missing link for interoperable representation

2019-01-13 Thread Marcel Schneider via Unicode


On 13/01/2019 17:52, Julian Bradfield via Unicode wrote:

On 2019-01-12, James Kass via Unicode  wrote:

This is a math formula:
a + b = b + a
... where the estimable "mathematician" used Latin letters from ASCII as
though they were math alphanumerics variables.


Yup, and it's immediately understandable by anyone reading on any
computer that understands ASCII.  That's why mathematicians write like
that in plain text.


As far as the information goes that was running until now on this List,
Mathematicians are both using TeX and liking the Unicode math alphabets.




This is an italicized word:
푘푎푘푖푠푡표푐푟푎푐푦
... where the "geek" hacker used Latin italics letters from the math
alphanumeric range as though they were Latin italics letters.


It's a sequence of question marks unless you have an up to date
Unicode font set up (which, as it happens, I don't for the terminal in
which I read this mailing list). Since actual mathematicians don't use
the Unicode math alphabets, there's no strong incentive to get updated
fonts.


These statements make me fear that the font you are using might unsupport
the NARROW NO-BREAK SPACE U+202F > <. If you see a question mark between
these pointy brackets, please let us know. Because then, You’re unable to
read interoperably usable French text, too, as you’ll see double punctuation
(eg "?!") where a single mark is intended, like here !

There is a crazy typeface out there, misleadingly called 'Courier New', as if
the foundry didn’t anticipate that at some point it would be better called
"Courier Obsolete". Or they did, but… (Referring to CLDR ticket #11423.)

BTW if anybody knows a version of Courier New updated to a decent level of
Unicode support, please be so kind and share the link so I can spread the word.




Where's the harm?


You lose your audience for no reasons other than technogeekery.


Aiming at extending the subset of environments supporting correct typesetting
is no geekery but awareness of our cultural heritage that we’re committed to
maintain and to develop, taking it over into the digital world while adapting
technology to culture, not conversely.


Best regards,

Marcel

Re: A last missing link for interoperable representation

2019-01-12 Thread Marcel Schneider via Unicode


On 12/01/2019 00:17, James Kass via Unicode wrote:
[…]

The fact that the math alphanumerics are incomplete may have been
part of what prompted Marcel Schneider to start this thread.


No, really not at all. I didn’t even dream of having italics in Unicode
working out of the box. That would exactly be the sort of demand that
would have completely discredited me advocating the use of preformatted
superscripts for the Unicode conformant and interoperable representation
of a handful of languages spoken by one third of mankind and using the
Latin script, while no other scripts are concerned with that orthographic
feature. (No clear borderline between orthography and typography here,
but with ordinal indicators in particular and abbreviation indicators in
general we’re clearly on the orthographic side. (SC2/WG3 would agree,
since they deemed "ª" and "º" worth encoding in 8-bit charsets.)

It started when I found in the XKB keysymdef.h four dead keysyms added
for Karl Pentzlin’s German T3, among which dead_lowline, and remembered
that at some point in history, users were deprived of the means of typing
a combining underscore. I didn’t think at the extra letterspacing (called
“gesperrt” spaced out in German) that Mark E. Shoulson mentioned upthread,
(a) because it isn’t used for that purpose in the locale I’m working for,
and (b) because emulating it with interspersed NARROW NO-BREAK SPACEs
would make that text unsearchable.



If stringing encoded italic Latin letters into words is an abuse of
Unicode, then stringing punctuation characters to simulate a "smiley"
(☺) is an abuse of ASCII - because that's not what those punctuation
characters are *for*.  If my brain parses such italic strings into
recognizable words, then I guess my brain is non-compliant.


I think that like Google Search having extensive equivalence classes
treating mathematical letters like plain ASCII, text-to-speech software
could use a little bit of AI to recognize strings of those letters as
ordinary words with emphasis, like James Kass suggested – the more as
we’re actually able to add combining diacritics for correct spelling
in some diacriticized alphabets (including a few with non-decomposable
diacritics), though with somewhat less-than-optimal diacritic placement
in many cases in the actual state of the art – and also parse ASCII art
correspondingly, unlike what happened in another example shared on
Twitter downthread of the math letters tweet:

https://twitter.com/ourelectra/status/1083367552430989315

Thanks,

Marcel

Re: A last missing link for interoperable representation

2019-01-07 Thread Marcel Schneider via Unicode


On 08/01/2019 06:32, Asmus Freytag via Unicode wrote:

On 1/7/2019 7:46 PM, James Kass via Unicode wrote:

Making recommendations for the post processing of strings containing the 
combining low line strikes me as being outside the scope of Unicode, though.


Agreed.

Those kinds of things are effectively "mark down" languages, a name chosen to 
define them as lighter weight alternatives to formal, especially SGML derived mark-up 
languages.

Neither mark-up nor mark down languages are in scope.


My hinting about post processing was only a door open to those tagging my 
suggestion as a dirty hack. I was so anxious about angry feedback that I 
inverted the order of the two possible usages despite my preference for keeping 
the combining underline while using proper fonts, fully agreeing with James 
Kass. I was pointing that unlike rich text, enhanced capabilities of plain text 
do not hold the user captive. With rich text we need to stay in rich text, 
whereas the goal of this thread is to point ways of ensuring interoperability.

The pitch is that if some languages are still considered “needing” rich text 
where others are correctly represented in plain text (stress, abbreviations), 
the Standard needs to be updated in a way that it fully supports actually all 
languages.

Having said that, still unsupported minority languages are top priority.

Best regards,

Marcel

A last missing link for interoperable representation

2019-01-07 Thread Marcel Schneider via Unicode


Previous discussions have already brought up how Unicode is supporting
those languages that despite being old in Unicode still require special
attention for their peculiar way of spacing punctuation or indicating
abbreviations. Now I wonder whether s̲t̲r̲e̲s̲s̲ can likewise be noted
in plain text without non-traditional markup such as *…* or …'… when a
language does not accept extra acute accents for that purpose.

One character we can think of is the combining underline.
Like everything else—new letters, narrow no-break space, superscripts—
the quality of the rendering depends on the fonts used on the computer.

Strings containing U+0332 COMBINING LOW LINE to denote stress, as a
replacement of italic, may be postprocessed to apply formatting, or
used as-is if interoperability matters along with semantic accuracy.

Best wishes,

Marcel

Preformatted superscript in ordinary text, paleography and phonetics using Latin script (was: Re: A sign/abbreviation for "magister" - third question summary)

2018-11-07 Thread Marcel Schneider via Unicode


On 06/11/2018 12:04, Janusz S. Bień via Unicode wrote:


On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:

Hi!

On the over 100 years old postcard

https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6

you can see 2 occurences of a symbol which is explicitely explained (in
Polish) as meaning "Magister".



[...]


The third and the last question is: how to encode this symbol in
Unicode?



A constructive answer to my question was provided quickly by James Kass:

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:

Mr͇ / M=ͬ


I answered:

On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote:

[...]


For me only the latter seems acceptable. Using COMBINING LATIN SMALL
LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
the base character. However in the lack of a better solution I can live
with it :-)

An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
supporting it are rather rare.


and Philippe Verdy commented:

On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:

[...]



There's a third alternative, that uses the superscript letter r,
followed by the combining double underline, instead of the normal
letter r followed by the same combining double underline.


Some comments were made also by Michael Everson:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]


I would encode this as Mʳ if you wanted to make sure your data
contained the abbreviation mark. It would not make sense to encode it
as M=ͬ or anything else like that, because the “r” is not modifying a
dot or a squiggle or an equals sign.  The dot or squiggle or equals
sign has no meaning at all. And I would not encode it as Mr͇, firstly
because it would never render properly and you might as well encode it
as Mr. or M:r, and second because in the IPA at least that character
indicates an alveolar realization in disordered speech. (Of course it
could be used for anything.)


FYI, I decided to use the encoding proposed by Philippe Verdy (if I
understand him correctly):

Mʳ̳

i.e.

'LATIN CAPITAL LETTER M' (U+004D)
'MODIFIER LETTER SMALL R' (U+02B3)
'COMBINING DOUBLE LOW LINE' (U+0333)

for purely pragmatic reasons: it is rendered quite well in my
Emacs. According to the 'fc-search-codepoint" script, the sequence is
supported on my computer by almost 150 fonts, so I hope to find in due
time a way to render it correctly also in XeTeX. I'm also going to add
it to my private named sequences list
(https://bitbucket.org/jsbien/unicode4polish).

The same post contained a statement which I don't accept:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]


The squiggle in your sample, Janusz, does not indicate anything; it is
only a decoration, and the abbreviation is the same without it.


One of the reasons I disagree was described by me in the separate thread
"use vs mention":

https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html

There were also some other statements which I find unacceptable:

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]


The abbreviation in the postcard, rendered in plain text, is "Mr".


He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
9:38 GMT (and earlier in a private mail).

I understand that both of them by "plane text" mean Unicode.


On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:


  You could use the various hacks you've discussed, with modifier
letters; but that is not "encoding", that is "abusing Unicode to do
markup". At least, that's the view I take!


and was supported by Asmus Freytag on Wed, Oct 31 2018 at  3:12
-0700.

The latter elaborated his view later and I answered:

On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote:

On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:


[...]


All else is just applying visual hacks


I don't mind hacks if they are useful and serve the intended purpose,
even if they are visual :-)


[...]


at the possible cost of obscuring the contents.


It's for the users of the transcription to decide what is obscuring the
text and what, to the contrary, makes the transcription more readable
and useful.


Please note that it's me who makes the transcription, it's me who has a
vision of the future use and users, and in consequence it's me who makes
the decision which aspects of text to encode. Accusing me of "abusing
Unicode" will not stop me from doing it my way.

I hope that at least James Kass understands my attitude:

On Mon, Oct 29 2018 at  7:57 GMT, James Kass via Unicode wrote:

[...]


If I were entering plain text data from an old post card, I'd try to
keep the data as close to the source as possible. Because that would
be my purpose. Others might have different purposes.


There were presented also some ideas which I would call "futuristic":
introducing a new combining character and using variations sequences.
This ideas should

Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-05 Thread Marcel Schneider via Unicode


On 05/11/2018 17:46, Doug Ewell via Unicode wrote:


Philippe Verdy wrote:
  

Note that I actually propose not just one rendering for the  but two possible variants (that would be equally
valid withou preference).
  
Actually you're not proposing them. You're talking about them (at

length) on the public mailing list. If you want to propose something,
you should consider writing a proposal.


The accepted meaning of "to propose" is not limited to the technical
sense it is used with respect to Unicode. Also, Philippe and I are both
influenced by our French locale, where "je propose" has pretty wide
semantics.

To conform with Unicode terminology, simply think "suggest", as in:
“Note that I actually suggest not just one rendering […].”

Thanks anyway for encouraging Philippe Verdy to submit the related
encodingproposal.

Best regards,

Marcel

Re: Encoding

2018-11-05 Thread Marcel Schneider via Unicode


On 04/11/2018 20:19, Philippe Verdy via Unicode wrote:
[…]

Even the mere fallback to render the  as
a dotted circle (total absence of support) will not block completely
reading the abbreviation:

* you'll see "2e◌" (which is still better than only "2e", with
minimal impact) instead of

* "2◌" (which is worse ! this is still what already happens when you
use the legacy encoded  which is also semantically
ambiguous for text processing), or

* "2e." (which is acceptable for rendering but ambiguous semantically
for text processing)


I’m afraid the dotted circle instead of the .notdef box would be confusing.



So compare things faily: the solution I propose is EVEN
MOREINTEROPERABLE than using  (which is
also impossible for noting all abbrevations as it is limited to just
a few letters, and most of the time limited to only the few lowercase
IPA symbols). It puts an end to the pressure to encode superscript
letters.


Actually it encompasses all Latin lowercase base letters except q.

As of putting an end to that pressure, that is also possible by encoding
the missing ones once and for all. As already stated, until the opposite
is posted authoritatively to this List, Latin script is deemed the only
one making extensive use of superscript to denote abbreviations, due to
strong and longlasting medieval practice acting as a template on a few
natural languages, namedly those enumerated so far, among which Polish.



If you want to support other notations (e.g. in chemical or
mathematics notations, where both superscript and subscript must be
present and stack together, and where the allowed varaition using a
dot or similar) you need another encoding and the existing legacy
 are not suitable as well.


I don’t lobby to support mathematics with more superscripts, but for
sure UnicodeMath would be able to use them when the set is complete.
What I did for chemical notations is to remind that chemistry seems
to be disfavored compared to mathematics, because instead of peculiar
subscripts it uses subscript Greek small letters. Three of them, as
has been reported on this List. They are being refused because they
are letters of a script. If they were fancy symbols, they would be
encoded, like alchemical symbols and mathematical symbols are.

Further, on 04/11/2018 20:51, Philippe Verdy via Unicode wrote:
[…]

Once again you need something else for these technical notations, but
NOT the proposed , and NOT EVEN the
existing "modifier letters" , which were in
fact first introduced only for IPA […]
[…] these letters are NOT conveying any semantic of an abbreviation,
and this is also NOT the case for their usage as IPA symbols).


They do convey that semantic if used in a natural language giving
superscript the semantics of an abbreviation.

Unicode does not encode semantics, TUS specifies.



There's NO interoperability at all when taking **abusively** the
existing  "modifier letters"  or  for use in abbreviations […].


The interoperabillty I mean is between formats and environments.
Interoperable in that sense is what is in the plain text backbone.


Keep these "modifier letters" or  or  for use as plain letters or plain digits or plain
punctuation or plain symbols (including IPA) in natural languages.


That is what I’m suggesting to do: Superscript letters are plain
abbreviation indicators, notably ordinal indicators and indicators
in other abbreviations, used in natural languages.


Anything else is abusive ans hould be considered only as "legacy"
encoding, not recommended at all in natural languages.


Put "traditional" in the place of "legacy", and you will come close
to what is actually going on when coding palaeographic texts is
achieved using purposely encoded Latin superscripts. The same
applies to living languages, because it is interoperable and fits
therefore Unicode quality standards about digitally representing
the world’s languages.

Finally, on 04/11/2018 21:59, Philippe Verdy via Unicode wrote:


I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which
no semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: […]

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either
encoded in Unicode, or encoded as plain letters modified by
superscripting style in CSS or HTML, or in word processors for
example): it fails to give the correct guess most of the time if
there's no user to confirm the actual intended meaning


I don’t agree: As opposed to baseline fallbacks, Unicode superscripts
allow the reader to parse the string as an abbreviation, and machines
can be programmed to act likewise.



Such confirmation is the job of spell correctors in word processors:
[…] the user may type "Mr." then the wavy line will appear under
these 3

Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Marcel Schneider via Unicode


Sorry, I didn’t truncate the subject line, it was my mail client.

On 04/11/2018 17:45, Philippe Verdy wrote:


Note that I actually propose not just one rendering for the
 but two possible variants (that would
be equally valid withou preference). Use it after any base cluster
(including with diacritics if needed, like combining underlines).

- the first one can be to render the previous cluster as superscript
(very easy to do implement synthetically by any text renderer)

- the second one can be to render it as an abbreviation dot (also
very easy to)

Fonts can provide their own mapping (e.g. to offer alternate glyph
forms or kerning for the superscript, they can also reuse the leter
forms used for other existing and encoded superscript letters, or to
position the abbreviation dot with negative kerning, for example
after a T), in which case the renderer does not have to synthetize
the rendering for the sequence combining sequence not mapped in the
font.

Allowing this variation from the start will:

- allow renderers to support it fast (so a rapid adoption for
encoding texts in humane languages, instead of the few legacy
superscript letters).

- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated
fonts (no requirement for them to add new glyphs if it's just to map
the same glyphs used by existing superscript letters)

- also prohibit the abuse of this mark for every text that one would
would to write in superscript (these cases can still uses the few
existing superscript letters/digits/signs that are already encoded),
so this is not suitable for example for marking mathematical
exponents (e.g. "x²", if it's encoded as  could validly be rendered as "x2."): exponents must use the
superscript (either the already encoded ones, or using external
styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
both as a style, but also some intended semantic of an exponent and
certainly not the intended semantic of an abbreviation)


Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don’t advocate
this use case, as I’m only lobbying for natural languages’ support as
specified in the Standard,* but it shouldn’t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I’m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.

Re: Encoding

2018-11-04 Thread Marcel Schneider via Unicode


On 04/11/2018 17:45, Philippe Verdy wrote:


Note that I actually propose not just one rendering for the
 but two possible variants (that would
be equally valid withou preference). Use it after any base cluster
(including with diacritics if needed, like combining underlines).

- the first one can be to render the previous cluster as superscript
(very easy to do implement synthetically by any text renderer)

- the second one can be to render it as an abbreviation dot (also
very easy to)

Fonts can provide their own mapping (e.g. to offer alternate glyph
forms or kerning for the superscript, they can also reuse the leter
forms used for other existing and encoded superscript letters, or to
position the abbreviation dot with negative kerning, for example
after a T), in which case the renderer does not have to synthetize
the rendering for the sequence combining sequence not mapped in the
font.

Allowing this variation from the start will:

- allow renderers to support it fast (so a rapid adoption for
encoding texts in humane languages, instead of the few legacy
superscript letters).

- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated
fonts (no requirement for them to add new glyphs if it's just to map
the same glyphs used by existing superscript letters)

- also prohibit the abuse of this mark for every text that one would
would to write in superscript (these cases can still uses the few
existing superscript letters/digits/signs that are already encoded),
so this is not suitable for example for marking mathematical
exponents (e.g. "x²", if it's encoded as  could validly be rendered as "x2."): exponents must use the
superscript (either the already encoded ones, or using external
styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
both as a style, but also some intended semantic of an exponent and
certainly not the intended semantic of an abbreviation)


Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don’t advocate
this use case, as I’m only lobbying for natural languages’ support as
specified in the Standard,* but it shouldn’t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I’m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.

Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Marcel Schneider via Unicode


On 03/11/2018 23:50, James Kass via Unicode wrote:


When the topic being discussed no longer matches the thread title,
somebody should start a new thread with an appropriate thread title.



Yes, that is what also the OP called for, but my last reply though
taking me some time to write was sent without checking the new mail,
so unfortunately it didn’t acknowledge. So let’s start this new thread
to account for Philippe Verdy’s proposal to encode a new format control.

But all what I can add so far prior to probably stepping out of this
discussion is that the industry does not seem to be interested in this
initiative. Why do I think so? As already discussed on this List, even
the long-existing FRACTION SLASH U+2044 has not been implemented by
major vendors, except that HarfBuzz does implement it and makes its
specified behavior available in environments using HarfBuzz, among
which some major vendors’ products are actually available with
HarfBuzz support.

As a result, the Polish abbreviation of Magister as found on the
postcard, and all other abbreviations using superscript that have
been put into parallel in the parent thread, cannot be reliably
encoded without using preformatted superscript, so far as the goal
is a plain text backbone being in the benefit of reliable rendering
support, rather than a semantic-centered coding that may be easier
to parse by special applications but lacks wider industrial support.

If nevertheless,  is encoded and will
gain traction, or rather reversely: if it gains traction and will be
encoded (I don’t know which way around to put it, given U+2044 has
been encoded but one still cannot seem to be able to call it widely
implemented), I would surely add it on keyboard layouts if I will
still be maintaining any in that era.

Best regards,

Marcel

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode


On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
[quoted mail]


Using variation selectors is only appropriate for these existing 
(preencoded) superscript letters ª and º so that they display the 
appropriate (underlined or not underlined) glyph.


And it is for forcing the display of DIGIT ZERO with a short stroke:
0030 FE00; short diagonal stroke form; # DIGIT ZERO
https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt

From that it becomes unclear why that isn’t applied to 4, 7, z and Z
mentioned in this thread, to be displayed open or with a short bar.

It is not a solution for creating superscripts on any letters and 
mark that it should be rendered as superscript (notably, the base 
letter to transform into superscript may also have its own combining
diacritics, that must be encoded explicitly, and if you use the 
varaition selector, it should allow variation on the presence or 
absence of the underline (which must then be encoded explicitly as a

combining character.


I totally agree that abbreviation indicating superscript should not be
encoded using variation selectors, as already stated I don’t prefer it.


So finally what we get with variation selectors is: variation selector, combining diacritic> and precombined with the diacritic, variation selector> which is NOT 
canonically equivalent.


That seems to me like a flaw in canonical equivalence. Variations must
be canonically equivalent, and the variation selector position should
be handled or parsed accordingly. Personally I’m unaware of this rule.


Using a combining character avoids this caveat: combining diacritic, combining abbreviation mark> and 
precombined with the diacritic, combining abbreviation mark> which
ARE canonically equivalent. And this explicitly states the semantic
(something that is lost if we are forced to use presentational
superscripts in a higher level protocol like HTML/CSS for rich text
format, and one just extracts the plain text; using collation will
not help at all, except if collators are built with preprocessing
that will first infer the presence of a 
to insert after each combining sequence of the plain-text enclosed in
a italic style).


That exactly outlines my concern with calls for relegating superscript
as an abbreviation indicator to higher level protocols like HTML/CSS.


There's little risk: if the  is not 
mapped in fonts (or not recognized by text renderers to create 
synthetic superscript scripts from existing recognized clusters), it

will render as a visible .notdef (tofu). But normally text renderers
recognize the basic properties of characters in the UCD and can see
that  has a combining mark general 
property (it also knows that it has a 0 combinjing class, so 
canonical equivalences are not broken) to render a better symbols 
than the .notdef "tofu": it should better render a dotted circle. 
Even if this tofu or dotted circle is rendered, it still explicitly 
marks the presence of the abbreviation mark, so there's less 
confusion about what is preceding it (the combining sequence that was

supposed to be superscripted).


The problem with the  you are proposing
is that it contradicts streamlined implementation as well as easy
input of current abbreviations like ordinal indicators in French and,
optionally, in English. Preformatted superscripts are already widely
implemented, and coding of "4ᵉ" only needs two characters, input
using only three fingers in two times (thumb on AltGr, press key
E04 then E12) with an appropriately programmed layout driver. I’m
afraid that the solution with  would be
much less straightforward.


The  can also have its own selector> to select other styles when they are optional, such as 
adding underlines to the superscripted letter, or rendering the 
letter instead as underscript, or as a small baseline letter with a 
dot after it: this is still an explicit abbreviation mark, and the 
meaning of the plein text is still preserved: the variation selector
is only suitable to alter the rendering of a cluster when it has 
effectively several variants and the default rendering is not 
universal, notably across font styles initially designed for specific

markets with their own local preferences: the variation selector
still allows the same fonts to map all known variants distinctly,
independantly of the initial arbitrary choice of the default glyph
used when the variation selector is missing).


I don’t think German users would welcome being directed to input a
 plus a  instead of
a period.


Even if fonts (or text renderers may map the  to variable glyphs, this is purely stylictic, the semantic of
the plain text is not lost because the 
is still there. There's no need of any rich-text to encode it (the 
rich -text styles are not explicitly encoding that a superscript is 
actually an abbreviation mark, so it cannot also allow variation like

rendering an underscript, or a baseline small glyph with an added
dot. Typically a  used in an English
style would

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode

On 31/10/2018 at 19:34, Asmus Freytag via Unicode wrote:

On 10/31/2018 10:32 AM, Janusz S. Bień via Unicode wrote:
>
> Let me remind what plain text is according to the Unicode glossary:
> 
> Computer-encoded text that consists only of a sequence of code

> points from a given standard, with no other formatting or structural
> information.
> 
> If you try to use this definition to decide what is and what is not a

> character, you get vicious circle.
> 
> As mentioned already by others, there is no other generally accepted

> definition of plain text.

Being among those who argued that the “plain text” concept cannot—and
therefore mustn’t—be used per se to disallow the use of a more or less
restricted or extended set of characters in what is called “ordinary text”,
I’m ending up adding the following in case it might be of interest:

This definition becomes tautological only when you try to invoke it in making 
encoding decisions, that is, if you couple it with the statement that only 
"elements of plain text" are ever encoded.

I don’t think that Janusz S. Bień’s concern is about this definition
being “tautological”. AFAICS the Unicode definition of “plain text” is
quoted to back the assumption that it’s hard to use that concept to argue
against the use of a given Unicode character in a given context, or to
use it to kill a proposal for characters significant in natural languages.

The reasoning is that the call not to use character X in plain text, while X is
a legal Unicode character whose use is not discouraged for technical reasons,
is like if “ordinary people” (scarequoted derivative from “ordinary text”) were
told that X is not a Unicode character. That discourse is a “vicious circle” in
that there is no limit to it until Latin script is pulled down to plain ASCII.
As already well known, diacritics are handled by the rendering system and don’t
need to be displayed as such in the plain text backbone. I don’t believe that
the same applies to other scripts, but these are often not considered when the
encoding of Latin preformatted letters is fought, given superscripting seems
to be proper to Latin, and originated from longlasting medieval practice and
writing conventions.

For that purpose, you need a number of other definitions of "plain text". 
Including the definition that plain text is the "backbone" to which you apply 
formatting and layout information. I personally believer that there are more 
2D notations where it's quite obvious to me that what is "placed" is a text 
element. More like maps and music and less like a circuit diagram, where the 
elements are less text like (I deliberately include symbols in the definition 
of text, but not any random graphical line art).

All two-dimensional notations here (outside the parenthetical) use higher-level
protocols; maps and diagrams are often vector graphics. But Unicode strived to
encode all needed plain text elements, such as symbols for maritime and wheather
maps. Even arrows of many possible shapes, including 3D-looking ones, have been
encoded. While freehand (rather than “any random”) graphical art is out of 
scope,
we have a lot of box drawing, used with appropriate fonts to draw e.g. layouts 
of
keyboards above the relevant source code in plain text files (examples in XKB).

As a sidenote: Box drawing while useful is unduly neglected on font level, even
in the Code Charts where the advance width, usually half an em, is inconsistent
between different sorts of elements belonging to the same block.

Another definition of plain text is that which contains the "readable content" 
of the text.

As already discussed on this List, many documents in PDF have hard-to-read plain
text backbones, even misleading Google Search, for the purpose of handling 
special
glyphs (and, in some era, even special characters).

As we've discussed here, this definition has edge cases; some 
content is traditionally left to styling.

Many pre-Unicode traditions are found out there, that stay in use, partly for
technical reasons (mainly by lack of updated keyboard layouts), partly for
consistency with accustomed ways of doing. Being traditionally-left-to-styling
is the more unconvincing. Even a letter that got to become LATIN SMALL LETTER O 
E
(Unicode 1.0) was composed on typewriters using the half-backspace, and should 
be
_left to styling_ when it was pulled out of the draft ISO/IEC 8859-1 by the 
fault
of a Frenchman (name undisclosed for privacy). And we’ve been told on this List
that the tradition using styling (a special font) to display the additional 
Latin
letters used to write Bambara survived.

Example: some of the small words in 
some Scandinavian languages are routinely italicized to disambiguate their 
reading.

Other languages use titlecase to achieve the same disambiguation. E.g. French
titlecases the noun "Une" which means the "cover", not the undefined article,
and German did the same when "Ein(e)" is a numeral, but today,

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode


On 01/11/2018 16:43, Asmus Freytag via Unicode wrote:
[quoted mail]

I don't think it's a joke to recognize that there is a continuum here and that
 there is no line that can be drawn which is based on straightforward 
principles.

[…]

In this case, there is no such framework that could help establish pragmatic
 boundaries dividing the truly useful from the merely fanciful.


I think the red line was always between the positive and the negative answer to
the question whether a given graphic is relevant for legibility/readability of
the plain text backbone. But humans can be trained to mentally disambiguate
a mass of confusables, so the line vanishes and the continuum remains intact.

On 02/11/2018 06:22, Asmus Freytag via Unicode wrote:

On 11/1/2018 7:59 PM, James Kass via Unicode wrote:


Alphabetic script users write things the way they are spelled and spell things
 the way they are written.  The abbreviation in question as written consists of
 three recognizable symbols.  An "M", a superscript "r", and an equal sign
 (= two lines).  It can be printed, handwritten, or in fraktur; it will still
 consist of those same three recognizable symbols.

We're supposed to be preserving the past, not editing it or revising it.


Alphabetic script users' handwriting does not match print in all features.
Traditional German handwriting used a line like a macron over the letter 'u'
to distinguish it from 'n'. Rendering this with a u-macron in print would be
the height of absurdity.

I feel similarly about the assertion that the "two lines" are something that
 needs to be encoded, but only an expert would know for sure.


Indeed it would be relevant to know whether it is mandatory in Polish, and I’m
not an expert. But looking at several scripts using abbreviation indicators as
superscript, i.e. Latin and Cyrillic (when using the Latin-script-written
abbreviation of "Numero", given Cyrillic for "N" is "Н", so it’s strictly
speaking one single script, and two scripts using it), then we can easily see
how single and double underlines are added or not depending on font design
and on customary writing and display. E.g. the Romance feminine and masculine
ordinal indicators have one or zero underlines, to such extent that French
typography specifies that the masculine ordinal indicator, despite beinga
superscript small o, is unfit to compose the French "numéro" abbreviation,
that must not have an underline. Hence DEGREE SIGN is less bad than U+00BA.

If applying the same to Polish, "Magister" is "Mʳ" and is straigtforward
to input when using a new French keyboard layout or an enhanced variant of
any national Latin one having small supersripts on the Shift+Num level, or
via a ‹superscript› dead key, mapped e.g. on Shift + AltGr/Option + E or
any of the 26 letter keys as mnemonically convenient ("superscript"
translates to French "exposant"); or ‹Compose› ‹^› [e] (where the ASCII
circumflex or caret is repurposed for superscript compose sequences, while
‹circumflex accent› is active *after* LESS-THAN SIGN, consistently with the
*new* convention for ‹inverted breve› using LEFT PARENTHESIS rather than "g)".

These details are posted in this thread on this List rather than CLDR-USERS
in order to make clear that typing superscript letters directly via the
keyboard is easy, and therefore to propose it is not to harrass the end-user.

On 02/11/2018 13:09, Asmus Freytag via Unicode wrote:
[quoted mail]
[…]

To transcribe the postcard would mean selecting the characters appropriate
 for the printed equivalent of the text.


As already suggested, selecting the variants can be done using variation
selectors, provided the Standard has defined the intended use case.



If the printed form had a standard way of superscripting letters with a
 decoration below when used for abbreviations,


As already pointed out, Latin script does not benefit from a consensus
to use underline for superscript. E.g. Italian, Portuguese and Spanish
do use underline for superscript, English and French do not.


then, and only then would we start discussing whether this decoration
needs to be encoded, or whether it is something a font can supply as part
of rendering the (sequence of) superscripted letters.


I think the problem is not completely outlined, as long as the use of
variation sequences is not mentioned. There is no "all" or "nothing"
dilemma, given Unicode has the means of providing a standard way of
representing calligraphic variations using variation selectors. E.g.
the letter ENG is preferred in big lowercase form when writing
Bambara, while other locales may like it in hooked uppercase.
The Bambara Arial font allows to make sure it is the right glyph,
and Arial in general follows the Bambara preference, but other fonts
do not, while some of them have the Bambara-fit glyph inside but
don’t display it unless urged by an OpenType supporting renderer,
and appropriate settings turned on, e.g. on a locale identifier basis.


(Perhaps with the aid of

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Marcel Schneider via Unicode

On 01/11/2018 01:21, Asmus Freytag via Unicode wrote:

On 10/31/2018 3:37 PM, Marcel Schneider via Unicode wrote:

On 31/10/2018 19:42, Asmus Freytag via Unicode wrote:

[…]

It is a fallacy that all text output on a computer should match the convention
of "fine typography".

Much that is written on computers represents an (unedited) first draft. Giving
such texts the appearance of texts, which in the day of hot metal typography,
was reserved for texts that were fully edited and in many cases intended for
posterity is doing a disservice to the reader.

The disconnect is in many people believing the user should be disabled to
write
[prevented from writing]

Thank you for correcting.

his or her language without disfiguring it by lack of decent keyboarding, and
that such input should be considered standard for user input. Making such text
usable for publishing needs extra work, that today many users cannot afford,
while the mass of publishing has increased exponentially over the past decades.
The result is garbage, following the rule of “garbage in, garbage out.”

No argument that there are some things that users cannot key in easily and that
the common
fallbacks from the days of typewritten drafts are not really appropriate in
many texts that
otherwise fall short of being "fine typography".

The goal I wanted to reach by discussing and invalidating the biased and
misused concept
of “fine typography” is that this thread could get rid of it, but I’m
definitely unfortunate.
It’s hard for you to understand that relegating abbreviation indicators into
the realm of
“fine typography” recalls me what I got to hear (undisclosed for privacy) when
asking that
the French standard keyboard layouts (plural) support punctuation spacing with
NARROW NO-BREAK SPACE, and that is closely related to the issue about social
media that
you pointed below.

Don’t worry about users not being able to “key in easily” what is needed for
the digital
representation of their language, as long as:

1. Unicode has encoded what is needed;

2. Unicode does not prohibit the use of the needed characters.

The rest is up to keylayout designers. Keying in anything else is not an issue
so far.

The real
disservice to the reader is not to enable the inputting user to write his or her
language correctly. A draft whose backbone is a string usable as-is for
publishing
is not a disservice, but a service to the reader, paying the reader due respect.
Such a draft is also a service to the user, enabling him or her to streamline
the
workflow. Such streamlining brings monetary and reputational benefit to the
user.

I see a huge disconnect between "writing correctly" and "usable as-is for
publishing". These
two things are not at all the same.

Publishing involves making many choices that simply aren't necessary for more "rough
& ready"
types of texts. Not every twitter or e-mail message needs to be "usable as-is for
publishing", but
should allow "correctly written" text as far as possible.

Not every message, especially not those whose readers expect a quick response.
The reverse is true with new messages (tweets, thread lauchers, requests,
invitations).
As already discussed, there are several levels of correctness. We’re talking
only about
the accurate digital representation of human languages, which includes correct
punctuation.
E.g. in languages using letter apostrophe, hashtags made of a word including an
apostrophe
are broken when ASCII or punctuation apostrophe (close quote) is used, as we’ve
been told.

Supposedly part of this discussion would be streamlined if one could experience
how easy
it can be to type in one’s language’s accurate digital representation. But
it’s better
to be told what goes on, and what “strawmen” we’re confused with, since, again,
informed discussion brings advancement.

When "desktop publishing" as it was called then, became available, too many
people started to
obsess with form over content. You would get these beautifully laid out
documents, the contents
of which barely warranted calling them a first draft.

Typing in one’s language’s accurate digital representation is not being
obsessed with form
over content, provided that appropriate keyboarding is available. E.g. the
punctuation
apostrophe is on level 1 where the ASCII apostrophe is when digits are locked
on level 1
on the French keyboard I’ve in use; else, digits are on level 3 where is also
superscript e
for ready input of most of the ordinals (except 1ᵉʳ/1ʳᵉ, 2ⁿᵈ for ranges, and
plural with ˢ):
2ᵉ 3ᵉ 4ᵉ 5ᵉ 6ᵉ 7ᵉ 8ᵉ 9ᵉ 10ᵉ 11ᵉ 12ᵉ. Hopefully that demo makes clear what is
intended.
Users not needing accurate repsesentation in a given string are free to type in
otherwise.

The goal of this discussion is that Unicode allow accurate representation, not
impose it.
Actually Unicode is still imposing inaccurate representation to some

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

On 01/11/2018 at 00:41, Martin J. Dürst wrote:
> 
> On 2018/11/01 03:10, Marcel Schneider via Unicode wrote:
> > On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:
> 
> >> When one does question the Académie about the fact, this is their
> >> reply:
> >>
> >> Le fait de placer en exposant ces mentions est de convention
> >> typographique ; il convient donc de le faire. Les seules exceptions
> >> sont pour Mme et Mlle.
> > Translation:
> > “Superscripting these mentions is typographical convention;
> > consequently it is convenient to do so. The only exceptions are
> > for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", 
> > Ms].”
> >>
> >> which, if my understanding of "convient" is correct, carefully does
> >> quite say that it is *wrong* not to superscript, but that one should
> >> superscript when one can because that is the convention in typography.
> 
> As for translation of "il convient", I think Julian is closer to the 
> intended meaning. The verb "convenir" has several meanings (see e.g. 
> https://www.collinsdictionary.com/dictionary/french-english/convenir), 
> but especially in this impersonal usage, the meaning "it is advisable, 
> it is right to, it is proper to" seems to be most appropriate in this 
> context.
> 
> It may not at all be convenient (=practical) to use the superscripts, 
> e.g. if they are not easily available on a keyboard.

Very good, thank you. I forgot about the meaning of “convenient”, and
didn’t think at “advisable” nor at “right to, proper to”.

The point about keyboarding is essential. As long as superscripts are 
considered exotic or at least very special and need to be grabbed off 
a character picker, there is no point in bothering users with inputting 
them. But since that is going to change, it would be fine that Unicode 
be ready to back the corresponding keyboard layouts so that they 
won’t get challenged by the sort of considerations prevailing among 
hardliners. Partly, i.e. for fr(-FR) ordinal indicators, Unicode is ready.

Best regards,

Marcel
> 
> (French isn't my native language, and nor is English)
(Neither is mine either, but I’m based in France since a long time.)

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

On 31/10/2018 19:42, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 11:10 AM, Marcel Schneider via Unicode wrote:
> > 
> > > which, if my understanding of "convient" is correct, carefully does
> > > [not] quite say that it is *wrong* not to superscript, but that one should
> > > superscript when one can because that is the convention in typography.
> >
> > Draft style may differ from mail style, and this, from typography, only 
> > due to the limitations imposed by input interfaces. These limitations are 
> > artificial and mainly the consequence of insufficient development of said 
> > interfaces. If the computer is anything good for, then that should also 
> > include the transition from typewriter fallbacks to the true digital 
> > representation of all natural languages. Latin not excluded.
> 
> It is a fallacy that all text output on a computer should match the 
> convention 
> of "fine typography".
> 
> Much that is written on computers represents an (unedited) first draft. 
> Giving 
> such texts the appearance of texts, which in the day of hot metal typography, 
> was reserved for texts that were fully edited and in many cases intended for 
> posterity is doing a disservice to the reader.
> 

The disconnect is in many people believing the user should be disabled to write 
his or her language without disfiguring it by lack of decent keyboarding, and 
that such input should be considered standard for user input. Making such text 
usable for publishing needs extra work, that today many users cannot afford, 
while the mass of publishing has increased exponentially over the past decades. 
The result is garbage, following the rule of “garbage in, garbage out.” The 
real 
disservice to the reader is not to enable the inputting user to write his or 
her 
language correctly. A draft whose backbone is a string usable as-is for 
publishing
is not a disservice, but a service to the reader, paying the reader due 
respect. 
Such a draft is also a service to the user, enabling him or her to streamline 
the 
workflow. Such streamlining brings monetary and reputational benefit to the 
user.

That disconnect seems to originate from the time where the computer became a 
tool 
empowering the user to write in all of the world’s languages thanks to Unicode. 
The concept of “fine typography” was then used to draw a borderline between 
what 
the user is supposed to input, and what he or she needs to get for publication. 
In the same move, that concept was extended in a way that it should include the 
quality of the string, additionally to what _fine typography_ really is: fine 
tuning of the page layout, such as vertical justification, slight variations in 
the width of non-breakable spaces, and of course, discretionary ligatures.

Producing a plain text string usable for publishing was then put out of reach 
of most common mortals, by using the lever of deficient keyboarding, but also 
supposedly by an “encoding error” (scare quotes) in the line break property of 
U+2008 PUNCTUATION SPACE, that should be non-breakable like its siblings 
U+2007 FIGURE SPACE (still—as per UAX #14—recommended for use in numbers) and 
U+2012 FIGURE DASH to gain the narrow non-breaking space needed to space the 
triads in numbers using space as a group separator, and to space big 
punctuation 
in a Latin script using locale, where JTC1/SC2/WG2 had some meetings for the 
UCS:
French.

For everybody having beneath his or her hands a keyboard whose layout driver is 
programmed in a fully usable way, the disconnect implodes. At encoding and 
input 
levels (the only ones that are really on-topic in this thread) the sorcery 
called 
fine typography sums then up to nothing else than having the keyboard inserting 
fully diacriticized letters, right punctuation, accurate space characters, and 
superscript letters as ordinal indicators and abbreviation endings, depending 
on the requirements.

Now was I talking about “all text output on a computer”? No, I wasn’t. 

The computer is able to accept input of publishing-ready strings, since we have 
Unicode. Precluding the user from using the needed characters by setting up 
caveats and prohibitions in the Unicode Standard seems to me nothing else than 
an outdated operating mode. U+202F NARROW NO-BREAK SPACE, encoded in 1999 for 
Mongolian [1][2], has been readily ripped off by the French graphic industry. 
In 2014, TUS started mentioning its use in French [3]; in 2018, it put it on 
top [4]. 
That seems to me a striking example of how things encoded for other purposes 
are reused (or following a certain usage, “abused”, “hacked”, “hijacked”) in 
locales like French. If it wasn’t an insult to minority languages, that 
language could be called, too, “digitally disfavored” in a certain sense.

> On the other hand, I'm a firm believer in applying certain styling

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

On 31/10/18 at 23:05, Asmus Freytag via Unicode wrote:
[…]
> > Sad that Arabic ² and ³ are still missing.
>
> How about all the other sets of native digits?

The missing ones are hopefully already on the roadmap.
Or do you refer to the missing ² and ³ in all other native digits?
Obviously they need to be encoded if there is a demand like 
for Arabic.

Thanks for the call.

Best regards,

Marcel

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:
> 
> On 2018-10-31, Marcel Schneider via Unicode  wrote:
> 
> > Preformatted Unicode superscript small letters are meeting the French 
> > superscript 
> > requirement, that is found in:
> > http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
> > (in French). This brief article focuses on the spelling of the indicators, 
> > without questioning the fact that they are superscript.
> 
> When one does question the Académie about the fact, this is their
> reply:
> 
> Le fait de placer en exposant ces mentions est de convention
> typographique ; il convient donc de le faire. Les seules exceptions
> sont pour Mme et Mlle.
Translation: 
“Superscripting these mentions is typographical convention; 
consequently it is convenient to do so. The only exceptions are 
for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].”
> 
> which, if my understanding of "convient" is correct, carefully does
> quite say that it is *wrong* not to superscript, but that one should
> superscript when one can because that is the convention in typography.

Draft style may differ from mail style, and this, from typography, only 
due to the limitations imposed by input interfaces. These limitations are 
artificial and mainly the consequence of insufficient development of said 
interfaces. If the computer is anything good for, then that should also 
include the transition from typewriter fallbacks to the true digital 
representation of all natural languages. Latin not excluded.

> 
> My original question was:
> 
> Dans les imprimés ou dans le manuscrit on écrit "1er, 45e"
> etc. (J'utilise l'indication HTML pour les lettres supérieures.)
> 
> La question est: est-ce que les lettres supérieures sont
> *obligatoires*, ou sont-ils simplement une question de style? C'est à
> dire, si on écrit "1er, 45e" etc., est-ce une erreur, ou un style
> simple mais correct? 
Translation: 
“In print or handwriting one spells "1er, 45e", 
and so on. (I’m using HTML tags for the superscript letters.)

The question is: Are the superscript letters *mandatory*, 
or are they simply a matter of style? I.e. when writing "1er, 45e", 
is that a mistake, or a simple but correct style?”
> 
> I did not think that their Dictionary desk would understand the
> concept of plain text, so I didn't ask explicitly for their opinions
> on encoding :)

If you don’t think that they would understand character encoding 
and the concept of plain text as described in the Unicode Standard, 
you may wish to explain it to them in detail prior to asking for 
their opinion on the subject.

Thank you anyway for letting us know.

> 
> Which takes us back to when typography is plain text...

When the typographc rendering is congruent with the underlying 
plain text, that means that there is no formatting; but that is quite
impossible given the minimal default settings include a font and 
a font-size. If the plain text is an interoperable representation of 
a natural language, and that language uses superscript as an 
abbreviation indicator, that superscript must be visible when the 
text string is displayed as-is. Else the string referred to as “plain 
text” is at risk of not being a legible representation of the intended
content. If despite that risk it is, then you are lucky.

Best regards,

Marcel

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

On 31/10/2018 at 17:03, Khaled Hosny wrote:
>
> A while I was localizing some application to Arabic and the developer
> “helpfully” used m² for square meter, but that does not work for Arabic
> because there is no superscript ٢ in Unicode, so I had to contact the
> developer and ask for markup to be used for the superscript so that O
> can use it as well. That nicely shows one of the problems with encoding
> superscript symbols for arbitrary text styling in Unicode, you can’t
> stop before duplicating the whole character repertoire or else you will be
> discriminating against some writing system or uncommon usage.

It seems to me that Arabic is lacking two characters when using Eastern 
Arabic digits, not Western Arabic. Unicode allowing the m² and m³ unit 
notations, these should be implemented in any script using the same 
notation. Not the whole UCS, just these two, like Arabic per cent. Or do 
you have use cases in Arabic where superscript is used as an 
abbreviation indicator?

I don’t share the view according to which superscript is arbitrary in Latin.
There is a medieval tradition of superscripting. If it is in Arabic, then it 
would be limited to these two missing digits. Many many symbols were 
encoded for Arabic, notably mirrored arrows, so adding these two is quite
straightforward.

Sad that Arabic ² and ³ are still missing.

Best regards,

Marcel

Re: A sign/abbreviation for "magister" (was: Re: second attempt)

2018-10-31 Thread Marcel Schneider via Unicode

On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
> > You could use the various hacks
> > you've discussed, with modifier letters; but that is not "encoding",
> > that is "abusing Unicode to do markup". At least, that's the view I
> > take!
>
> +1

There seems to be a widespread confusion about what is plain text, and what 
Unicode is for. From an US-QWERTY point of view, a current mental 
representation 
of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with 
acute.
Unicode is granting to every language its plain text representation. If 
superscript
acts as abbreviation indicator in a given language, this is part of the plain 
text 
representation of that language. 

So far, so good. The core problem is now to determine whether superscript is 
mandatory, and baseline is fallback, or superscript is optional and decorative, 
and baseline is correct. That may be a matter of opinion, as has been 
suggested. 
However we know now a list of languages where superscript is mandatory, and 
baseline is fallback. Leaving English alone, these languages on themselves need 
the use of preformatted superscript letters being granted to them by the UTC.

Still in the beginning, when early Unicode set up the Standard, superscript
was ruled out of plain text, except when there was sort of a strong lobbying, 
like when Vietnamese precomposed letters were added. Phoneticists have a strong 
lobby, so they got some ranges of preformatted letters. To make sure nobody 
dare use them in running text elsewhere, all *new* superscript letters got 
names 
on a MODIFIER LETTER basis, while subscript letters got straightforward names 
having SUBSCRIPT in them. Additionally, strong caveats were published in TUS.

And the trick worked, as most of the time, one is now referring to the 
superscript 
letters using the “modifier letter” label that Unicode have decked them out 
with.

That is why, today, any discussion is at risk of being subject to strong biases 
when its result should allow some languages to use their traditional 
abbreviation 
indicators, in an already encoded and implemented form. Fortunately the front 
has 
begun to move, as CLDR TC have granted ordinal indicators to the French locale 
per v34. 

Ordinal indicators are one category of abbreviation indicators. Consistently, 
the
already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in 
titles
like "Sª", "Nª Sª", as found in the navigation pane of:
http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea

I’m not quite sure whether some people would still argue that that string isn’t 
understood differently from "Na Sa".

> In general, I have a certain sympathy for the position that there is no 
> universal
> answer for the dividing line between plain and styled text; there are some 
> texts
> where the conventional division of plain test and styling means that the plain
> text alone will become somewhat ambiguous.

That is why phonetics need preformatted super- and subscripts, and so do 
languages
relying on superscript as an abbreviation indicator.

> We know that for mathematics, a different dividing line meant that it is 
> possible
> to create an (almost) plain text version of many (if not most) mathematical
> texts; the conventions of that field are widely shared -- supporting a case 
> for
> allowing a standard encoding to support it.

Referring to Murray Sargent’s UnicodeMath, a Nearly Plain Text Encoding of 
Mathematics, 
https://www.unicode.org/notes/tn28/
is always a good point in this discussion. UnicodeMath uses the full range of 
superscript digits, because the range is full. It does not use superscript 
letters, 
because their range is not full. Hence if superscript digits had stopped at the 
legacy range "¹²³", only measurement units like the metric equivalents of sq ft 
and 
cb ft could be written with superscripts, and that is already allowed according 
to
TUS. I’m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. 
Anyway, 
since phonetics need a full range of superscript and subscript digits, these 
were 
added to Unicode, and therefore are used in UnicodeMath.

Likewise, phonetics need a nearly-full range of superscript letters, so these 
were 
added to Unicode, and therefore are used in the digital representation of 
natural 
languages.

> However, it stops short of 100% support for edge cases, as does the ordinary
> plain text when used for "normal" texts. I think, on balance, that is OK.

That is not clear as long as “ordinary plain text” is not defined for the 
purpose 
of this discussion. Since I have superscript small letters on live keys, and 
the 
superscript "ᵉ" even doubled on the same level as the digits (that it is used 
to 
transform into ordinals for most of them), my French keyboard layout driver 
allows 
the OS to output ordinary plain text consisting

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Marcel Schneider via Unicode

Thank you for your feedback.

On 30/10/2018 at 22:52, Khaled Hosny wrote:

> > First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671.
> > But it is a vowel sign. Many letters put above are called superscript 
> > when explaining in English.
> 
> As you say, this is a vowel sign not a superscript letter, so the name
> is a misnomer at best. It should have been called COMBINING ARABIC
> LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic
> it is called small or dagger alef.

Thank you for this information. Indeed the current French translation 
named it:
0670 DIACRITIQUE VOYELLE ARABE ALIF EN CHEF
* l'appellation anglaise de ce caractère est erronée
http://hapax.qc.ca/ListeNoms-10.0.0.txt
Translation:
0670 COMBINING ARABIC VOWEL ALEF ABOVE
* the English designation of this character is mistaken

Sorry for mistyping its code point, and for forgetting these facts.
What’s surprising, then, may be the facility it was named using SUPERSCRIPT, 
while superscripts seemed to be disliked in the Standard.

I note, now, that it should be called COMBINING ARABIC LETTER ALEF ABOVE,
as you indicate. (Translating to French as DIACRITIQUE LETTRE ARABE ALIF EN 
CHEF).

> 
> > There is the range U+FC5E..U+FC63 (presentation forms).
> 
> That is a backward compatiplity block no one is supposed to use, there
> are many such backward comatipility presentation forms even of Latin
> script (U+FB00..U+FB4F).
> 
> So I don’t see what makes you think, based on this, that Unicode is
> favouring Arabic or other scripts over Latin.

Indeed it doesn’t. Sorry about my assumption, but I mainly cited Arabic 
first because its name starts with an A, and I remembered it uses a 
“SUPERSCRIPT” in running text.

Other scripts have:
10FC MODIFIER LETTER GEORGIAN NAR
#  10DC
2D6F TIFINAGH MODIFIER LETTER LABIALIZATION MARK
#  2D61
A69C MODIFIER LETTER CYRILLIC HARD SIGN
#  044A
A69D MODIFIER LETTER CYRILLIC SOFT SIGN
#  044C
[but the latter two are for dialectology]
These are in the Duployan block:
1BCA2 SHORTHAND FORMAT DOWN STEP
1BCA3 SHORTHAND FORMAT UP STEP
because vertical alignment is significant in stenography.
So it is in Latin script when superscript us used as an 
abbreviation indicator.
However I see that the subjoiners and subjoined letters 
are obeying to another scheme than what led to super- or 
subscript.

On 31/07/2018 at 08:27, Martin J. Dürst wrote:
>
> > Making a safe distinction is beyond my knowledge, safest is not to 
> > discriminate.
>
> Yes. The easiest way to not discriminate is to not use titles in mailing 
> list discussions. That's what everybody else does, and what I highly 
> recommend.

OK. That is sound practice, which I observed a long time, until I felt best 
using Dr. 
Thanks for clearing it up.

On 30/10/2018 at 21:34, Julian Bradfield via Unicode wrote:

> The practice of using superscripts to end abbreviations is alive and
> well in manuscript - I do it myself in writting notes for myself. For
> example, "condition" I will often write as "condn", and
> "equation" as "eqn".

That tends to prove that legibility is suboptimal without superscripts, 
even in note/draft style, and consequently, in machine processed plain text 
“only more so” (quoting an expression from Ken Whistler’s reply to 
James Kass on 30/10/2018 05:54).

> > in the 17ᵗʰ or 18ᵗʰ century to keep it only for ordinals. Should Unicode 
> 
> What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
> is not now (when superscripting looks very old-fashioned) and never
> has been any requirement to superscript them, as far as I know -
> though since the OED doesn't have an entry for "1st", I can't easily
> check.

Then French, Italian, Portuguese and Spanish seem to be the only locales having 
superscript ordinal indicator requirements, or preferences if you prefer. 

The following forum has a comprehensive explanation for English, and for 
Romance 
languages except French:
https://english.stackexchange.com/questions/111265/should-ordinal-indicators-be-inline
Especially it explains where the American English lining ordinal indicators 
came from.

English Wikipedia’s Ordinal indicator article 
https://en.wikipedia.org/wiki/Ordinal_indicator
states that ordinal indicators and superscript letters don’t share the same 
glyph, 
which would explain why there was an intent to project a proposal for encoding 
French 
ordinal indicators. (But I advised that that would be a waste of time, as 
Unicode’s 
preformatted superscripts are working out of the box.) 

Preformatted Unicode superscript small letters are meeting the French 
superscript 
requirement, that is found in:
http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
(in French). This brief article focuses on the spelling of the indicators, 
without questioning the fact that they are superscript.

On 31/08/2018 at 06:54, Janusz S. Bień via Unicode wrote:
[…]
> BTW, I find it strange that nobody refers to an old thread
> 
>

Re: A sign/abbreviation for "magister"

2018-10-30 Thread Marcel Schneider via Unicode

On 30/10/2018  at 21:34, Khaled Hosny via Unicode wrote:
> 
> On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote:
> > E.g. in Arabic script, superscript is considered worth 
> > encoding and using without any caveat, whereas when Latin script is on, 
> > superscripts are thrown into the same cauldron as underscoring.
> 
> Curious, what Arabic superscripts are encoded in Unicode?

First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671.
But it is a vowel sign. Many letters put above are called superscript 
when explaining in English.

There is the range U+FC5E..U+FC63 (presentation forms).

Best regards,

Marcel

Re: A sign/abbreviation for "magister"

2018-10-30 Thread Marcel Schneider via Unicode

Rather than a dozen individual e-mails, I’m sending this omnibus reply 
for the record, because even if here and in CLDR (SurveyTool forum and 
Trac) everything has already been discussed and fixed, there is still 
a need to stay acknowledging, so as not to fail following up, with 
respect to the oncoming surveys, next of which is to start in 30 days.

First here: On 29/10/2018 at 12:43, Dr Freytag via Unicode wrote:

[…]
> The use of superscript is tricky, because it can be optional in some
> contexts; if I write "3rd" in English, it will definitely be
> understood no different from "3rd". 

[Note that this second instance was actually intended to read "3ʳᵈ", 
but it was formatted using a higher-level protocol.]

[…]
> In TeX the two transition fluidly. If I was going to transcribe such
> texts in TeX, I would construct a macro […]
[…]
> Nevertheless, I think the use of devices like combining underlines
> and superscript letters in plain text are best avoided.

While most other scripts from Arabic to Duployan are generously granted 
all and everything they need for accurate representation, starting with 
preformatted superscripts and ending with superscripting or subscripting 
format controls, Latin script is often quite deliberately pulled down 
in order to make it unusable outside high-end DTP software, from 
TeX to Adobe InDesign, with the notable exception of sparsely and 
parsimoniously encoded preformatted characters for phoneticists and 
medievalists. E.g. in Arabic script, superscript is considered worth 
encoding and using without any caveat, whereas when Latin script is on, 
superscripts are thrown into the same cauldron as underscoring.

Obviously Unicode don’t apply to Latin script the same principle they 
do to all other scripts, i.e. to free preformatted letters as suitable 
if they are part of a standard representation and in some cases are 
needed to ensure unambiguity. Mediterranean locales had preformatted 
ordinal indicators even in the Latin-1-only era, despite "1a" and "2o" 
may be understood no different from "1ª" and 2º". The degree sign, that 
is on French keyboards, is systematically hijacked to represent the 
"n°" abbreviation, unless a string is limited to ASCII-only. Several 
Latin-script-using locales have standard representations and strong 
user demands for superscripts, which instead of being satisfied on 
Unicode level as would be done for any other of the world’s scripts, 
are obstinately rebuffed when not intended for phonetics, or in 
some cases, for palaeography.

I wasn’t digging down to find out about those UTC members who on a 
regular basis are aggressively contradicting ballot comments about 
encoding palaeographic Latin letters, while proving unable to sustain 
any open and honest discussion on this List or elsewhere. Referring to 
what Dr Everson via Unicode wrote on 28/10/2018 at 21:49:

> I like palaeographic renderings of text very much indeed, and in fact
> remain in conflict with members of the UTC (who still, alas, do NOT
> communicate directly about such matters, but only in duelling ballot
> comments) about some actually salient representations required for
> medievalist use.

That said: On 29/10/2018 at 09:09, James Kass via Unicode wrote:
[…]
> If I were entering plain text data from an old post card, I'd try
> to keep the data as close to the source as possible. Because that
> would be my purpose. Others might have different purposes. 
> As you state, it depends on the intention. But, if there were an
> existing plain text convention I'd be inclined to use it. 
> Conventions allow for the possibility of interchange, direct
> encoding would ensure it.

The goal of discouraging Latin superscripts is obviously to ensure 
that reliable document interchange is limited to the PDF. 

If Unicode were allowed to emit an official recommendation to use 
preformatted superscripts in Latin script, too, then font designers 
would implement comprehensive support of combining diacritics, and 
any plain text including superscripted abbreviations could use the 
preformatted characters, in order to gather the interoperability 
that Unicode was designed for. Referring to what Dr Verdy via Unicode 
wrote on 28/10/2018 at 19:01:

[…]
> However it is still not very elegant if we stil need to use only
> the limited set of superscript letters (this still reduces the
> number of abbreviations, such as those commonly used in French
> that needs a superscript "é")

The use of combining diacritics with preformatted superscripts is 
also the reason why Unicode is limiting encoding support to base 
letters, even for preformatted superscript letters. The rule that 
no *new* precomposed letters with acute accent are encoded anymore 
applies to superscripts too. A Unicode-conformant way to represent 
such abbreviations would IMO use U+1D49 followed by U+0301: ,ᵉ́,.
Other representations may require OpenType support, which in Latin 
script is often turned off, supposedly in order to

Re: A sign/abbreviation for "magister"

2018-10-29 Thread Marcel Schneider via Unicode

On 29/10/18 20:29, Doug Ewell via Unicode wrote:
[…]
> ObMagister: I agree that trying to reflect every decorative nuance of
> handwriting is not what plain text is all about.

Agreed.

> (I also disagree with
> those who insist that superscripted abbreviations are required for
> correct spelling in certain languages, and I expect to draw swift
> flamage for that stance.)

It all (no “flamage”, just trying to understand) depends on how we 
set the level of requirements, and what is understood by “correct”.
There is even an official position arguing that representing an "œ" 
with an "oe" string is correct, and that using the correct "œ" is 
not required. 

> The abbreviation in the postcard, rendered in
> plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion

In English, “Mr” for “Mister” is correct, because English does not use 
superscript here, according to my knowledge. Ordinal indicators are 
considered different, and require superscript in correct representation.
Thus being trained on English, one cannot easily evaluate what is 
correct and what is required for correctness in a neighbor locale.

> just
> fuels the recurring demands for every Latin letter (and eventually those
> in other scripts) to be duplicated in subscript and superscript, à la
> L2/18-206.

That is a generic request, unrelated to any locale, based only on a kind 
of criticism of poor rendering systems. The “fake super-/subscripts” are 
already fixed if only OpenType is supported and fonts are complete.

> 
> Back into my hole now.

No worries. Stay tuned :-) Informed discussion brings advancement.

Best regards,

Marcel

Group separator migration from U+00A0 to U+202F

2018-09-17 Thread Marcel Schneider via Unicode

For people monitoring this list but not CLDR-users:

To be cost-effective, the migration from the wrong U+00A0 to the correct U+202F 
as group separator 
should be synched across all locales using space instead of comma or period. SI 
is international and 
specifies narrow fixed-width no-break space as mandatory in the role of a 
numbers group separator.

That is the place to remember that Unicode would have had such a narrow 
fixed-width no-break space
from its very beginning on, if U+2008 PUNCTUATION SPACE had beed treated 
equally like its relative,
U+2007 FIGURE SPACE, both being designed for legacy-style hard-typeset tabular 
numbers representation.
We can only ask why it was not, without any hope of ever getting an authorized 
response on this list (see 
a recent thread about non-responsiveness; subscribers knowing the facts are 
here but don’t post anymore).

So this is definitely not the place to vent about that misdesign, but it is 
about the way of fixing it now.
After having painstakingly catched up support of some narrow fixed-width 
no-break space (U+202F). 
the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a 
single rush is way more 
cost-effective than migrating one locale this time, another locale next time, a 
handful locales the time 
after, possibly splitting them up in sublocales with different migration 
schedules. I really believed that 
now Unicode proves ready to adopt the real group separator in French, all 
relevant locales would be 
consistently pushed for correcting that value in release 34. The v34 alpha 
overview makes clear they
are not.

http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

I aimed at correcting an error in CLDR, not at making French stand out. Having 
many locales and 
sublocales stick with the wrong value makes no sense any more.

https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

The only effect is implementers skipping migration for fr-FR while waiting for 
the others to catch up, 
then doing it for all at once.

There seems to be a misunderstanding: The *locale setting* is whether to use 
period, comma, space, 
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic.
Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is **not a locale 
setting**,
but it’s all about Unicode *design* and Unicode *implementation.*
I really thought that that was clear and that there’s no need to heavily insist 
on the ST "French" forum.
When referring to the "French thousands separator" I only meant that unlike 
comma- or period-using 
locales, the French locale uses space and that the group separator space should 
be the correct one.
That did **not** mean that French should use *another* space than the other 
locales using space.

https://unicode.org/cldr/trac/ticket/11423

Regards,

Marcel

Re: Shortcuts question

2018-09-17 Thread Marcel Schneider via Unicode

On 17/09/18 05:38 Martin J. Dürst wrote:
[quote]
> 
> From my personal experience: A few years ago, installing a Dvorak 
> keyboard (which is what I use every day for typing) didn't remap the 
> control keys, so that Ctrl-C was still on the bottom row of the left 
> hand, and so on. For me, it was really terrible.
> 
> It may not be the same for everybody, but my experience suggests that it 
> may be similar for some others, and that therefore such a mapping should 
> only be voluntary, not default.

Got it, thanks!

Regards,

Marcel

Re: Shortcuts question

2018-09-16 Thread Marcel Schneider via Unicode

On 15/09/18 15:36, Philippe Verdy wrote:
[…]
> So yes all control keys are potentially localisable to work best with the 
> base layout anre remaining mnemonic;
> but the physical key position may be very different.

An additional level of complexity is induced by ergonomics. so that most 
non-Latin layouts may wish to stick 
with QWERTY, and even ergonomic layouts in the footprints of August Dvorak 
rather than Shai Coleman are 
likely to offer variants with legacy Virtual Key mapping instead of staying in 
congruency with graphics optimized 
for text input. But again that is easier on Windows, where VKs are remapped 
separately, than on Linux that 
appears to use graphics throughout to process application shortcuts, and only 
modifiers can be "preserved" for
further processing, no underlying letter map that AFAIU appears not to exist on 
Linux.

However, about keyboarding, that may be technically too detailed for this List, 
so that I’ll step out of this thread 
here. Please follow up in parallel thread on CLDR-users instead.

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

Thanks,

Marcel

Re: Shortcuts question (is: Thread transfer info)

2018-09-07 Thread Marcel Schneider via Unicode

Hello,

I’ve followed up on CLDR-users:

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

As a sidenote — It might be hard to get a selection of discussions 
actually happen on CLDR-users instead of Unicode Public mail list, 
as long as subscribers of this list don’t necessarily subscribe to 
the other list, too, that still has way less subscribers than Unicode Public.

Regards,

Marcel

EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD in YAML))

2018-09-07 Thread Marcel Schneider via Unicode

On 07/09/18 22:07 Eli Zaretskii via Unicode wrote:
> 
> > Date: Fri, 7 Sep 2018 12:47:44 -0700
> > Cc: d3c...@gmail.com, Doug Ewell ,
> > unicode 
> > From: Rebecca Bettencourt via Unicode 
> > 
> > On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode  wrote:
> > 
> > That version has been announced in the Windows 10 Hub several weeks ago.
> > 
> > And it only took them 33 years. :) 
> 
> That's OK, because Unix tools cannot handle Windows end-of-line format
> to this very day. About the only one I know of is Emacs (which
> handles all 3 known EOL formats independently of the platform on which
> it runs, since 20 years ago).

What are you referring to when you say “Unix tools”?
Another text editor—the built-in one of many Linux distributions—Gedit allows 
to choose from “Unix/Linux”, “Mac OS Classic”, and “Windows”, in the Save 
dialog.
But in the preferences I cannot retrieve how to default it to any of the latter 
two.
I’m referring to Ubuntu 16.04.

When on Windows in Notepad++ I prefer LF over CRLF because it makes for simpler 
regexes, and the middle thing between these and plain search is more handy too.
(I use \n in regexes rather than the $ convention.)

Thanks to Philippe for the Windows 10 news!

Best regards,

Marcel

Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-06 Thread Marcel Schneider via Unicode

On 06/09/18 19:09 Doug Ewell via Unicode wrote:
> 
> Marcel Schneider wrote:
> 
> > BTW what I conjectured about the role of line breaks is true for CSV
> > too, and any file downloaded from UCD on a semicolon separator basis
> > becomes unusable when displayed straight in the built-in text editor
> > of Windows, given Unicode uses Unix EOL.
> 
> It's been well known for decades that Windows Notepad doesn't display
> LF-terminated text files correctly. The solution is to use almost any
> other editor. Notepad++ is free and a great alternative, but there are
> plenty of others (no editor wars, please).
> 
> The RFC Editor site explains why it provides PDF versions of every RFC,
> nearly all of which are plain text:
> 
> "The primary version of every RFC is encoded as an ASCII text file,
> which was once the lingua franca of the computer world. However, users
> of Microsoft Windows often have difficulty displaying vanilla ASCII text
> files with the correct pagination."
> 
> which similarly assumes that "users of Microsoft Windows" have only
> Notepad at their disposal.

Thank you, I’ve got the point.

I’m taking this opportunity to apologize and disclaim for this post of mine:

https://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0134.html

where I was not joking, but completely out of matter, unable to make sense 
of the "Unicode Digest" subject line, that refers to a mail engine feature and 
remained unchanged due to limited editing capabilities in a cellphone mailer.
Likewise "unicode-requ...@unicode.org" is used by the engine for that purpose.

My apologies to Doug Ewell, and thanks for your kind reply taking the pain 
while having limited access to e-mail.

Best regards,

Marcel

Re: Shortcuts question

2018-09-06 Thread Marcel Schneider via Unicode

On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> 
> Hello. This may be slightly OT for this list but I'm asking it here as it 
> concerns computer usage with multiple scripts and i18n:

It actually belongs on CLDR-users list. But coming from you, it shall remain 
here while I’m posting a quick answer below.

> 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" 
> io Ctrl+A for "all"?

No, Ctrl+A remains Ctrl+A on a French keyboard.

> 2) How about when the shortcuts are the Alt+ combinations referring to 
> underlined letters in actual user visible strings?

I don’t know, but the accelerator shortcuts usually process text input, so it 
would be up to the vendor to keep them in sync.

> 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the 
> other XCV shortcuts) Z key or the Y key
> which is in the physical position of the QWERTY Z key (and close to the other 
> XCV shortcuts)?

On Windows, that this question refers to, virtual keys move around with 
graphics on Latin keyboards. While Ctrl+Z on QWERTZ is 
not handy, I can tell that it is Ctrl+Z on AZERTY with the key having the Z on 
it and typing "z". The latter is most relevant on Linux
where graphics are used even to process the Ctrl+ shortcuts.

> 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic 
> or Japanese?

On Windows as they depend on Virtual Keys, they may be laid out on an 
underlying QWERTY basis. The same may apply on macOS, 
where distinct levels are present in the XML keylayout (and likewise in 
system-shipped layouts) to map the letters associated with
shortcuts, regardless of the script. On Linux, shortcuts are reported not to 
work on some non-Latin keyboard layouts (because key names
are based on ISO key positions, and XKB doesn’t appear to use a "Group0" level 
to map the shortcut letters; needs to be investigated).

> 4a) I mean how are they displayed on screen? 

My short answer is: I’ve got no experience; maybe using Latin letters and 
locale labels.

> 4b) Like #1 above, are they changed per language?

Non-Latin scripts typically use QWERTY for ASCII input, so shortcuts may not be 
changed per language.

> 4c) Like #2 above, how about for user visible shortcuts?

Again I’m leaving this over to non-Latin script experts.

> (In India since English is an associate official language, most computer 
> users are at least conversant with basic English
> so we use the English/QWERTY shortcuts even if the keyboard physically shows 
> an Indic script.)

The same applies to virtually any non-Latin locale. Michael Kaplan reported 
that only on Latin keyboards VKs move around.

> Thanks!

You are welcome.

Marcel

Re: CLDR [terminating]

2018-09-04 Thread Marcel Schneider via Unicode

Sorry for not noticing that this thread belongs to CLDR-users, not to Unicode 
Public.
Hence I’m taking it off this list, welcoming participants to follow up there:

https://unicode.org/pipermail/cldr-users/2018-September/000833.html

Re: CLDR

2018-09-03 Thread Marcel Schneider via Unicode

On 03/09/18 09:53 Janusz S. Bień via Unicode wrote:
> 
> On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote:
> > The XML files in these folders:
> >
> > https://unicode.org/repos/cldr/tags/latest/common/
> 
> Thanks for the link.
> 
> In the meantime I rediscovered Locale Explorer
> 
> http://demo.icu-project.org/icu-bin/locexp
> 
> which I used some time ago.

Nice. Actually based on CLDR v31.0.1.

> 
> On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote:
> > On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
> > […]
> >> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> >> > one couldn’t simply pop them into XML or whatever, as the result would 
> >> > be 
> >> > disappointing and call for completion in the aftermath. Yet another task 
> >> > competing with CLDR survey.
> >> 
> >> Please elaborate. It's not clear for me what do you mean.
> >
> > These comments are designed for the Code Charts and as such must not be
> > disproportionate in exhaustivity. Eg we have lists of related languages 
> > ending 
> > in an ellipsis.
> 
> Looks like we have different comments in mind.

Then I’m sorry to be off-topic.

[…]
> >> > and we really 
> >> > need to go through the data and correct the many many errors, please.
> 
> But who is the right person or institution to do it?

Software vendors are committed to care for the data, and may delegate survey 
to service providers specialized in localization. Then I think that public 
language 
offices should be among the reviewers. Beyond, and especially by lack of the
latter, anybody is welcome to contribute as a guest. (Guest votes are 1 and 
don’t
add one to another.) That is consistent with the fact that Unicode relies on 
volunteers, too.

I’m volunteering to personally welcome you to contribute to CLDR.

[…]
> > Further you will see that while Polish is using apostrophe
> > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
> > CLDR does not have the correct apostrophe for Polish, as opposed eg to 
> > French.
> 
> I understand that by "the correct apostrophe" you mean U+2019 RIGHT
> SINGLE QUOTATION MARK.

Yes.

> 
> > You may wish to note that from now on, both U+0027 APOSTROPHE and 
> > U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
> > preferred characters in publishing are U+2019 and, for Polish, the U+201E 
> > and 
> > U+201D that are already found in CLDR pl.
> 
> The situation seems more complicated because the chart
> 
> https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html
> 
> contains different list of punctuation characters than
> 
> https://www.unicode.org/cldr/charts/34/summary/pl.html.
> 
> I guess the latter is the primary one, and it contains U+2019 RIGHT
> SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too).

It’s a bit confusing because there is a column for English and a column for 
Polish.
The characters you retrieved are actually in the English column, while Polish 
has 
consistently with By-Type, these quotation marks:
' " ” „ « » 
Hence the set is incomplete.

> 
> >
> > Note however that according to the information provided by English 
> > Wikipedia:
> > https://en.wikipedia.org/wiki/Quotation_mark#Polish
> > Polish also uses single quotes, that by contrast are still missing in CLDR.
> 
> You are right, but who cares? Looks like this has no practical
> importance. Nobody complains about the wrong use of quotation marks in
> Polish by Word or OpenOffice, so looks like the software doesn't use
> this information. So this is rather a matter of aesthetics...

I’ve come to the position that to let a word processor “use” quotation marks
is to miss the point. Quotation marks are definitely used by the user typing
in his or her text, and are expected to be on the keyboard layout he or she
is using. So-called smart quotes guessed algorithmically from ASCII simple 
and double quote are but a hazardous workaround when not installing the 
appropriate keyboard layout. At least that is my position :)

Best regards,

Marcel

Re: UCD in XML or in CSV? (is: UCD data consumption)

2018-09-01 Thread Marcel Schneider via Unicode

I’m not responding without thinking, as I was blamed of when I did,
but it is painful for me to dig into what Ken explained about how 
we should be consuming UCD data. I’ll now try to get some more clarity
into the topic.

> On 31/08/18 19:59 Ken Whistler via Unicode wrote:
> […]
> > 
> > Third, please remember that folks who come here complaining about the 
> > complications of parsing the UCD are a very small percentage of a very 
> > small percentage of a very small percentage of interested parties. 

OK, among avg. 700 list subscribers, relatively few are ever complaining 
about anything, let alone about this particular topic. But we should always 
keep in mind that many folks out there complaining about Unicode don’t come 
here to do so.

> > Nearly everybody who needs UCD data should be consuming it as a 
> > secondary source (e.g. for reference via codepoints.net), or as a 
> > tertiary source (behind specialized API's, regex, etc.),

Like already suggested, “as” should probably read “via” in that part.

> > or as an end 
> > user (just getting behavior they expect for characters in applications). 

That is more than a simple statement about who is consuming UCD data which 
way, as you say “should.” There seem to be assumptions that it is discouraged 
to dive into the raw data; that folks reading file headers are not doing well;
that the data should be assembled only in certain ways; and that ignorant 
people shouldn’t open the UCD cupboard to pick a file they deem useful.

If so, then it might be surprising to know that when submitting a proposal
about Bidi-mirroring mathematical symbols issues feedback
http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html
I’d started as a quasi-end-user not getting behavior I expected for characters 
in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like
it is implemented in web browsers, because I wanted that end-users could 
experience bidi-mirroring as it works. Unexpectedly a number of math symbols 
did not mirror, despite many of them being even scalar neighbors.

> > Programmers who actually *need* to consume the raw UCD data files and 
> > write parsers for them directly should actually be able to deal with the 
> > format complexity -- and, if anything, slowing them down to make them 
> > think about the reasons for the format complexity might be a good thing, 

I can see one main reason for the format complexity, and that is that data 
from various propeties don’t necessarily telescope the same way to make for 
small files. The complexity of UCD would then mainly be self-induced by the
way of packing data into one small file per property rather than adding the
value to each relevant code point in one large list as is UnicodeData.txt.

While I’m now taking the time to write this up because I’m committed to 
process that information, we can think of many many people who don’t like 
to be slowed down trying to find out why Unicode changed UCD design while 
following the original idea of a large CSV list would be straightforward, 
eventually by setting up a new one if the first one got stuck. What I can 
figure out is that while a new property was added, that particular property 
was always thought of as being the last one. 
(At some point the many files were then dumped into the known XML files.)

If UCD is to be made of small files, it is necessarily complex, and the 
conclusion is that there should be another large CSV grid to make things 
simple again and lightweight alike so far as they can.

> > as it tends to put the lie to the easy initial assumption that the UCD 
> > is nothing more than a bunch of simple attributes for all the code points.

Did you try the sentence when taking off “simple”? It appears to me as not 
being a lie then. One attribute comes to mind that is so complex that its 
design even changed over time, despite Unicode’s commitment to stability.
The Bidi_Mirrored_Glyph property was originally designed to include “best-fit”
pairs for least-worse display in applications not supporting RTL glyphs 
(ie without OpenType support), with legibility of math formulae in mind.
Later (probably due to a poorly written OpenType spec), no more best-fit pairs 
were added to BidiMirroring.txt, as if OpenType implementers weren’t to remove
the best-fit pairs anyway prior to using the file (while the spec says to use 
it as-is). That led then to the display problem pointed above.

I’m sparing the particular problem related to 3 pairs of symbols with tilde, 
nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. 

So you can understand that I’m not unaware of the complexity of UCD. Though
I don’t think that this could be an argument for not publishing a medium-size 
CSV file with scalar values listed as in UnicodeData.txt.

> 
> […]
> Even Excel Starter, that I have, is a great tool helping
> to perform tasks I fail to get with other tools, even spreadsheet software.

Ie not every spreadsheet

Re: UCD in XML or in CSV? (is: Parsing UCD in XML)

2018-09-01 Thread Marcel Schneider via Unicode

On 31/08/18 10:47 Manuel Strehl via Unicode wrote:
> 
> To handle the UCD XML file a streaming parser like Expat is necessary.

Thanks for the tip. However for my needs, Expat looks like overkill, and I’m 
looking out for a much simpler standalone tool, just converting XML to CSV.

> 
> For codepoints.net I use that data […]

Very good site IMO, as it compiles a lot of useful information trying to 
maximize 
human readability. 

Nice to have added the Adopt-a-character button, too.

Thanks,

Marcel

Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-01 Thread Marcel Schneider via Unicode

Thank you Marius for the example. Indeed I now see that YAML is a powerful means
for a file to have an intuitive readability while drastically reducing file 
size.

BTW what I conjectured about the role of line breaks is true for CSV too, and 
any file
downloaded from UCD on a semicolon separator basis becomes unusable when 
displayed straight in the built-in text editor of Windows, given Unicode uses 
Unix EOL.

 Still for use in spreadsheets, YAML needs to be converted to CSV, although 
that 
might not crash the browser as large XML does.

Regards,

Marcel

On 01/09/18 09:18 Marius Spix via Unicode wrote:
> 
> Hello Marcel,
> 
> YAML supports references, so you can refer to another character’s
> properties.
> 
> Example:
> 
> repertoire: 
> char:
> -
> name_alias: 
> - [NUL,abbreviation]
> - ["NULL",control]
> cp: 
> na1: "NULL"
> props: &
> age: "1.1"
> na: ""
> JSN: ""
> gc: Cc
> ccc: 0
> dt: none
> dm: "#"
> nt: None
> nv: NaN
> bc: BN
> bpt: n
> bpb: "#"
> Bidi_M: N
> bmg: ""
> suc: "#"
> slc: "#"
> stc: "#"
> uc: "#"
> lc: "#"
> tc: "#"
> scf: "#"
> cf: "#"
> jt: U
> jg: No_Joining_Group
> ea: N
> lb: CM
> sc: Zyyy
> scx: Zyyy
> Dash: N
> WSpace: N
> Hyphen: N
> QMark: N
> Radical: N
> Ideo: N
> UIdeo: N
> IDSB: N
> IDST: N
> hst: NA
> DI: N
> ODI: N
> Alpha: N
> OAlpha: N
> Upper: N
> OUpper: N
> Lower: N
> OLower: N
> Math: N
> OMath: N
> Hex: N
> AHex: N
> NChar: N
> VS: N
> Bidi_C: N
> Join_C: N
> Gr_Base: N
> Gr_Ext: N
> OGr_Ext: N
> Gr_Link: N
> STerm: N
> Ext: N
> Term: N
> Dia: N
> Dep: N
> IDS: N
> OIDS: N
> XIDS: N
> IDC: N
> OIDC: N
> XIDC: N
> SD: N
> LOE: N
> Pat_WS: N
> Pat_Syn: N
> GCB: CN
> WB: XX
> SB: XX
> CE: N
> Comp_Ex: N
> NFC_QC: Y
> NFD_QC: Y
> NFKC_QC: Y
> NFKD_QC: Y
> XO_NFC: N
> XO_NFD: N
> XO_NFKC: N
> XO_NFKD: N
> FC_NFKC: "#"
> CI: N
> Cased: N
> CWCF: N
> CWCM: N
> CWKCF: N
> CWL: N
> CWT: N
> CWU: N
> NFKC_CF: "#"
> InSC: Other
> InPC: NA
> PCM: N
> blk: ASCII
> isc: ""
> 
> -
> cp: 0001
> na1: "START OF HEADING"
> name_alias: 
> - [SOH,abbreviation]
> - [START OF HEADING,control]
> props: *
> 
> 
> 
> 
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
> schrieb Marcel Schneider wrote:
> 
[…]

Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-09-01 Thread Marcel Schneider via Unicode

On 31/08/18 08:25 Marius Spix via Unicode wrote:
> 
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
> 
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.

Thanks for advice. Already I do use YAML syntaxic highlighting to display 
XCompose files, that use the colon as a separator, too.

Did you figure out how YAML would fit UCD data? It appears to heavily rely
on line breaks, that may get lost as data turns around across environments.
XML indentation is only a readability feature and irrelevant to content. The 
structure is independent of invisible characters and is stable if only graphics
are not corrupted (while it may happen that they are). Linebreaks are odd in
that they are inconsistent across OSes, because Unicode was denied the 
right to impose a unique standard in that matter. The result is mashed-up 
files, and I fear YAML might not hold out.

Like XML, YAML needs to repeat attribute names in every instance. That 
is precisely what CSV gets around of, at the expense of readability in 
plain text. Personally I could use YAML as I do use XML for lookup in
the text editor, but I’m afraid that there is no advantage over CSV with
respect to file size.

Regards,

Marcel
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
> 
[…]

Re: UCD in XML or in CSV?

2018-08-31 Thread Marcel Schneider via Unicode

On 31/08/18 19:59 Ken Whistler via Unicode wrote:
[…]
> Second, one of the main obligations of a standards organization is 
> *stability*. People may well object to the ad hoc nature of the UCD data 
> files that have been added over the years -- but it is a *stable* 
> ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
> tweaking formats of data files to meet complaints about one particular 
> parsing inconvenience or another. That would create multiple points of 
> discontinuity between versions -- worse than just having to deal with 
> the ongoing growth in the number of assigned characters and the 
> occasional addition of new data files and properties to the UCD.

I did not want to make trouble asking for moving conventions back and forth.
I liked to learn why UnicodeData.txt was released as a draft without a header 
and nothing, given Unicode knew well in advance that the scheme adopted 
at first release would be kept stable for decades or forever. 

Then I’d like to learn how Unicode came to not devise a consistent scheme
for all the UCD files if any such could be devised, so that people could get 
able to assess whether complaints about inconsistencies are well-founded
or not. It is not enough for me that a given adhockery is stable; IMO it should 
also be well-designed, in responsiveness facing history from a standards body.
That is not what one is telling about UnicodeData.txt, although it is the only 
effectively formatted file in UCD for streamlined processing. Was there not 
enough time to think about a header line and a file header? With the header 
line it would be flexible, and all the problems would be solved if specifying 
that parsers should start with counting the field number prior to creating 
storage arrays. We are lacking a real history of Unicode, explaining why 
everybody was in a hurry. “Authors falling like flies” is the only hint that 
comes to mind.

And given Unicode appear to have missed the hit, to discuss whether it 
would be time to add a more accomplished file for better usability.

> 
> Keep in mind that there is more to processing the UCD than just 
> "latest". People who just focus on grabbing the very latest version of 
> the UCD and updating whatever application they have are missing half the 
> problem. There are multiple tools out there that parse and use multiple 
> *versions* of the UCD. That includes the tooling that is used to 
> maintain the UCD (which parses *all* versions), and the tooling that 
> creates UCD in XML, which also parses all versions. Then there is 
> tooling like unibook, to produce code charts, which also has to adapt to 
> multiple versions, and bidi reference code, which also reads multiple 
> versions of UCD data files. Those are just examples I know off the top 
> of my head. I am sure there are many other instances out there that fit 
> this profile. And none of the applications already built to handle 
> multiple versions would welcome having to permanently build in tracking 
> particular format anomalies between specific versions of the UCD.

That point is clear to me, and even when suggesting to make changes to
BidiMirrored.txt, I had alternatives with a stable existing file and a new 
enhanced file. But what is totally unclear to me is what are old versions 
doing in compiling latest data. Delta is OK, research on particular topic in
old data is OK, but what does it mean to need to parse *all* versions to 
get newest products?
> 
> Third, please remember that folks who come here complaining about the 
> complications of parsing the UCD are a very small percentage of a very 
> small percentage of a very small percentage of interested parties. 
> Nearly everybody who needs UCD data should be consuming it as a 
> secondary source (e.g. for reference via codepoints.net), or as a 
> tertiary source (behind specialized API's, regex, etc.), or as an end 
> user (just getting behavior they expect for characters in applications). 
> Programmers who actually *need* to consume the raw UCD data files and 
> write parsers for them directly should actually be able to deal with the 
> format complexity -- and, if anything, slowing them down to make them 
> think about the reasons for the format complexity might be a good thing, 
> as it tends to put the lie to the easy initial assumption that the UCD 
> is nothing more than a bunch of simple attributes for all the code points.

That makes no sense to me. UCD raw data is and remains a primary source,
I see no way to consume it as a secondary source or as a tertiary source. 
Do you mean to consume it via secondary or tertiary sources? Then we 
actually appear to consume those sources instead of UCD raw data.
These sources are fine for the purpose of getting information about some 
particular code points, but most of these tools I remember don’t allow to 
filter values and compute overviews, nor to add data, as we can do it 
in spreadsheet software. Honestly are we so few people

Re: CLDR (was: Private Use areas)

2018-08-31 Thread Marcel Schneider via Unicode

On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
[…]
> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> > one couldn’t simply pop them into XML or whatever, as the result would be 
> > disappointing and call for completion in the aftermath. Yet another task 
> > competing with CLDR survey.
> 
> Please elaborate. It's not clear for me what do you mean.

These comments are designed for the Code Charts and as such must not be
disproportionate in exhaustivity. Eg we have lists of related languages ending 
in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt
to be fed in an extensible and unconstrained format (without any constraint 
as of available space, number and length of comments, and so on), any lack 
is felt as a discriminating neglect, and there will be a huge rush adding data.
Yet Unicode hasn’t set up products where that data could be published, ie not 
in the Code Charts (for the abovementioned reason), not in ICU so far as the 
additional information involved does not match a known demand on user side 
(localizing software does not mean providing scholarly exhaustive information
about supported characters). The use will be in character pickers providing 
every available information about a given character. That is why Unicode is
to prioritize CLDR for CLDR users, rather than extra information for the web.

> 
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
> 
> Which XML? where?

More precisely it is LDML, the CLDR-specific XML.
What I called “digest charts” are the charts found here:

http://www.unicode.org/cldr/charts/34/

The access is via this page:

http://cldr.unicode.org/index/downloads

where the charts are in the Charts column, while the raw data is under SVN Tag.

> 
> > and we really 
> > need to go through the data and correct the many many errors, please.
> 
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.

I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
for the access to the XML data (except when knowing about SubVersioN).
Polish data is found here:

https://www.unicode.org/cldr/charts/34/summary/pl.html

The access is via the top of the "Summary" index page (showing root data):

https://www.unicode.org/cldr/charts/34/summary/root.html

You may wish to particularly check the By-Type charts:

https://www.unicode.org/cldr/charts/34/by_type/index.html

Here I’d suggest to first focus on alphabetic information and on punctuation.

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

Under Latin (table caption, without anchor) we find out what punctuation 
Polish has compared to other locales using the same script.
The exact character appears when hovering the header row.
Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
an error in almost every locale using hyphen. TC is about to correct that.

Further you will see that while Polish is using apostrophe
https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
You may wish to note that from now on, both U+0027 APOSTROPHE and 
U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
U+201D that are already found in CLDR pl.

Note however that according to the information provided by English Wikipedia:
https://en.wikipedia.org/wiki/Quotation_mark#Polish
Polish also uses single quotes, that by contrast are still missing in CLDR.

Now you might understand what I meant when pointing that there are still 
many errors in many languages in CLDR, including in English.

Best regards,

Marcel

> 
> Best regards
> 
> Janusz
> 
> -- 
> , 
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
> 
>

UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-08-30 Thread Marcel Schneider via Unicode

nicodeData.txt 
that I can see zero technical reasons not to add it. Processes using the file 
to generate
an overview of Unicode also use other files and are thus able to process 
comments correctly,
whereas those processes using UnicodeData to look up character properties 
provided in the file 
would start searching the code point. (Perhaps there are compilers building 
DLLs from the file.)

Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode  a écrit :
>


UnicodeData.txt was devised long before any of the other UCD data files. Though 
it might seem like a simple enhancement to us, adding a header block, or even a 
single line, would break a lot of existing processes that were built long ago 
to parse this file.

>
So Unicode can't add a header to this file, and that is the reason the format 
can never be changed (e.g. with more columns). That is why new files keep 
getting created instead.

>
The XML format could indeed be expanded with more attributes and more 
subsections. Any process that can parse XML can handle unknown stuff like this 
without misinterpreting the stuff it does know.

>
That's why the only two reasonable options for getting UCD data are to read all 
the tab- and semicolon-delimited files, and be ready for new files, or just 
read the XML. Asking for changes to existing UCD file formats is kind of a 
non-starter, given these two alternatives.

>

>

>
--
Doug Ewell | Thornton, CO, US | ewellic.org


>

 Original message 
Message: 3

Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode 
> 
>
Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

Re: Unicode Digest, Vol 56, Issue 20

2018-08-30 Thread Marcel Schneider via Unicode

Thank you for looking into this. First, I’m unable to retrieve the publication 
you are citing, 
but a February thread had nearly the same subject, referring to Vol. 50. How 
did you 
compute these figures? Is that a code phrase to say: “The same questions over 
and 
over again; let’s settle this on the record, as a reference for later 
inquiries.”

Also, "unicode-requ...@unicode.org" doesn’t appear to seem to be a valid e-mail 
address.
That would mean that I’d better send a proposal with an enhancement request to
docsub...@unicode.org, rather than contribute to the topic while it is being 
discussed 
on the Unicode Public Mail List?

OK I’ll try to get something out of this, because many people really want 
things to grow 
better:

On 30/08/18 20:37 Doug Ewell via Unicode wrote:
> 
> UnicodeData.txt was devised long before any of the other UCD data files.

I can’t think of any era in the computer age where file headers were uncommon, 
and where a parser able to process semicolons couldn’t be directed to make sense
of crosshatches. If ever releasing a headerless file was a mistake, 
implementers 
would be able to anticipate that it might be corrected at some point. 
Implementations 
are to be updated at every single Unicode release, that’s what I’m able to 
tell, while 
ignoring the arcanes of frozen APIs.

> Though it might seem like a simple enhancement to us, adding a header block, 
> or even a single line,
> would break a lot of existing processes that were built long ago to parse 
> this file.

They are hopelessly outdated anyway, and most of them would have been replaced 
with something 
better since a long time. The remainder might not be worth bothering the rest 
of the world with 
headerless files.

> So Unicode can't add a header to this file, and that is the reason the format 
> can never be changed
> (e.g. with more columns). That is why new files keep getting created instead.

I figured out something like that rationale, and I can also understand that 
Unicode isn’t going 
to keep releasing headerless files while waiting for a guy telling them not to 
do so, and then
to suddenly add the missing header. Also I didn’t really ask for that, but 
suggested adding 
yet another *new* file, not changing the data structure of the existing 
UnicodeData.txt. 

As of the reference, a Google search for "unicodedataextended.txt" just brought 
it up:
http://www.unicode.org/review/pri297/

Having said that, I still think that while not parsing a header line in a 
process is a 
reasonable position if the field structure is known to be stable, not being 
able to *skip* 
a header is sort of odd.

> The XML format could indeed be expanded with more attributes and more 
> subsections.
> Any process that can parse XML can handle unknown stuff like this without 
> misinterpreting
> the stuff it does know.

Agreed. I’m not questioning XML. But I’m using spreadsheets. I don’t know how 
many computer
scientists do use spreadsheets. Perhaps we’re not many looking up 
UnicodeData.txt that way
(I use it in raw text, too, and I look up ucd.nounihan.flat.xml). Generating 
code in a 
spreadsheet is considered quick-and-dirty. I don’t agree it’s dirty, but it’s 
quick.

And above all, it appears that doing certain research in spreadsheets is the 
most efficient 
way to check whether character properties are matching character identity. 
Using spreadsheet 
software is trivial, so it might be disconsidered and left to non-scientists, 
while it is 
closer to human experience and allows to do research in nearly no time, by 
adding columns, 
filters and formulae, that one would probably spend weeks to code in C, Lisp, 
Perl or Python 
(that I cannot do, so I’m biased).

> That's why the only two reasonable options for getting UCD data are to read 
> all the tab- and semicolon-delimited files,
> and be ready for new files, or just read the XML. Asking for changes to 
> existing UCD file formats is kind of a non-starter,
> given these two alternatives.

Given the above, one can easily understand why I do not agree with being 
limited to these two
alternatives. 

Given a process must be able to be updated to be able… to grab a newly added 
small file 
from the UCD, it can as well be updated to be able to skip file comments, and 
even to be able 
to parse a new *large* file from the UCD.

On the other hand, given Unicode are ready to add new small semicolon-delimited 
files, 
they might wish to add as well a new *large* semicolon-delimited file to the 
UCD.
That large file would have a file header and a header line, and be specified as 
being flexible.
That file might have one hundred fields delimited by 99 semicolons. These 5 
million semicolons 
would still be more lightweight than 5 million attribute names plus the XML 
tags.

The added value is that people using spreadsheets have a handy file to import, 
rather than 
each individual having to convert a large XML file to a large CSV file, by lack 
of the latter
being readily

Re: Private Use areas

2018-08-29 Thread Marcel Schneider via Unicode

On 29/08/18 07:55, Janusz S. Bień via Unicode wrote:
> 
> On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes:
> > On August 23, 2011, Asmus Freytag wrote:
> >
> >> On 8/23/2011 7:22 AM, Doug Ewell wrote:
> >>> Of all applications, a word processor or DTP application would want
> >>> to know more about the properties of characters than just whether
> >>> they are RTL. Line breaking, word breaking, and case mapping come to
> >>> mind.
> >>>
> >>> I would think the format used by standard UCD files, or the XML
> >>> equivalent, would be preferable to making one up:
[…]
> >>
> >> The right answer would follow the XML format of the UCD.
> >>
> >> That's the only format that allows all necessary information contained
> >> in one file,
> 
> For me necessary are also comments and crossreferences contained in
> NamesList.txt. Do I understand correctly that only "ISO Comment
> properties" are included in the file?

Even that comment field is obsoleted. But it’s unclear to me what exactly 
it was providing from ISO.

> 
> >> and it would leverage of any effort that users of the
> >> main UCD have made in parsing the XML format.
> >>
> >> An XML format shold also be flexible in that you can add/remove not
> >> just characters, but properties as needed.
> >>
> >> The worst thing do do, other than designing something from scratch,
> >> would be to replicate the UnicodeData.txt layout with its random, but
> >> fixed collection of properties and insanely many semi-colons. None of
> >> the existing UCD txt files carries all the needed data in a single
> >> file.

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It’s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that’s why at some point I submitted feedback asking 
for an extension. Indeed we could use more information than what is yielded 
by UCD \setminus NamesList.txt (that we may not parse, as per file header).
Given NamesList.txt / Code Charts comments are kept minimal by design, 
one couldn’t simply pop them into XML or whatever, as the result would be 
disappointing and call for completion in the aftermath. Yet another task 
competing with CLDR survey. Reviewing CLDR data is IMO top priority.
There are many flaws to be fixed in many languages including in English.
A lot of useful digest charts are extracted from XML there, and we really 
need to go through the data and correct the many many errors, please.

Unlike XML, human readability of CSV may not be immediate. Yes you simply 
cannot always count the semicolons and remember the property name from 
the value position if it isn’t obvious by itself. But we use spreadsheets. At 
least 
some people do. That’s where the magic works. 

Looking up things in a spreadsheet is a good way to find out about wrong 
property values. Looks like handling files only programmatically gets
everything screwed up.

Marcel

Re: Diacritic marks in parentheses

2018-07-26 Thread Marcel Schneider via Unicode

Indeed when target use is general, dialectological diacritics are visibly not 
an option, as 
despite being in Unicode since v7.0 (2014), they are still unsupported by 
mainstream.
Writing “der Arzt oder die Ärztin” or, depending on context, “einen Arzt oder 
eine Ärztin”, 
which I remember being common on package leaflets, is best practice.

Mit freundlichen Grüßen,

Marcel

On 26/07/18 18:27, Markus Scherer wrote:
> 
> I would not expect for Ä+combining () above = Ä᪻ to look right except with 
> specialized fonts.
> http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0
>
> Even if it worked widely, I think it would be confusing.
> I think you are best off writing Arzt/Ärztin.
>
>
> Viele Grüße,
> markus

Re: Diacritic marks in parentheses

2018-07-26 Thread Marcel Schneider via Unicode

We do have this already, in combining marks extended:
 
@@ 1AB0 Combining Diacritical Marks Extended 1AFF
@ Used for German dialectology
[…]
1ABB COMBINING PARENTHESES ABOVE
* intended to surround a diacritic above
1ABC COMBINING DOUBLE PARENTHESES ABOVE
1ABD COMBINING PARENTHESES BELOW
* intended to surround a diacritic below
1ABE COMBINING PARENTHESES OVERLAY
* intended to surround a base letter
* exact placement is font dependent

 
Best regards,
Marcel
 
On 26/07/18 12:49 Christoph Päper via Unicode wrote:
> 
> German umlauts often occur when a noun is plural or an agens noun is female, 
> e.g. _Arzt_ '(male) physician', _Ärzte_ 'physicians' and _Ärztin_ 
'female physician'. There are several cases where a short notation for both 
singular and plural or, more frequently, male and female singular are 
desired. A number of notations are commonly encountered, e.g. (not showing 
number pairs) _Doktor(in)_, _Doktor/-in_, _Doktor/in_, _DoktorIn_, 
_Doktor_in_, _Doktor*in_. 
> 
> These only[^1] work well if there is no umlaut difference, i.e. neither 
> _Ärzt/-in_ nor _Arzt/-in_ would be appropriate. A way to show the umlaut 
dots are conditional would be required but is not available in plain text 
systems and complicated to achieve in most rich text systems. Unicode has '⸚' 
HYPHEN WITH DIAERESIS (U+2E1A) to offer, i.e. _Arzt⸚in_ or _Arzt/⸚in_. This is 
also very uncommon, but may be used in some linguistic texts. 
> 
> I believe the most intuitive solution would be tiny parentheses before and 
> after the two dots. This has no established usage as far as I am aware of, 
so would probably not qualify for encoding in the Unicode Standard. However, if 
it would qualify nevertheless, should this be a new atomic diacritic 
mark, e.g. COMBINING PARENTHESIZED DIAERESIS ABOVE, or two characters, e.g. 
COMBINING OPEN PARENTHESES ABOVE and COMBINING 
CLOSE PARENTHESES ABOVE to be used with COMBINING DIAERESIS (U+0308)?
> 
> [^1] Yes, there are other cases where the stem changes in different ways, but 
> that is irrelevant here.
> 
>

Please disregard my mistaken e-mail

2018-06-13 Thread Marcel Schneider via Unicode

My last e-mail with subject “re: Your message to Unicode awaits moderator 
approval” 
was mistakenly sent to the Mailing List, for forgetting remove address in 
cc-field (end hidden).
Please disregard.
My apologies.
Marcel

re: Your message to Unicode awaits moderator approval

2018-06-13 Thread Marcel Schneider via Unicode

> Message du 13/06/18 22:25
> De : "via Unicode" 
> A : charupd...@orange.fr
> Copie à : 
> Objet : Your message to Unicode awaits moderator approval
> 
> Your mail to 'Unicode' with the subject
> 
> Re: The Unicode Standard and ISO
> 
> Is being held until the list moderator can review it for approval.
> 
> The reason it is being held:
> 
> Post to moderated list
> 
> Either the message will get posted to the list, or you will receive
> notification of the moderator's decision. If you would like to cancel
> this posting, please visit the following URL:
> 
> http://unicode.org/mailman/confirm/unicode/07224dbb3f89488430be25c396d1590baa55c022
> 
I’m unable to decide whether I should cancel this myself or do nothing. If 
there is no use in posting, 
the better. Anyway I’ve nothing more to tell on any list, as UTC isn’t 
interested in fixing the bidi 
legibility issue I’ve pointed out, and won’t probably be interested in 
deprecating U+2010 for not 
misleading font designers. Additionally some people post false allegations at 
my expense and start 
getting insultant, confusing Unicode Public with a WG2 meeting in the nineties, 
despite of all that 
having been discussed off-list past year.

On Tue, 12 Jun 2018 19:49:10 +0200, Mark Davis ☕️ via Unicode wrote:
[…]
> People interested in this topic should 
> (a) start up their own project somewhere else,
> (b) take discussion of it off this list,
> (c) never bring it up again on this list.

Thank you for letting us know. I apologize for my e-mailing. I didn’t respond 
in the wake for a variety of 
reasons while immediately fully agreeing, of course as I had mainly wondered 
why I got no feedback when 
I’d lastly terminated a thread turning likewise, but no matter anymore.
No problem, as far as it belongs to me, this topic will never be read again 
here nor elsewhere.

Sorry again.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-13 Thread Marcel Schneider via Unicode

On Tue, 12 Jun 2018 19:49:10 +0200, Mark Davis ☕️ via Unicode wrote:
[…]
> People interested in this topic should 
> (a) start up their own project somewhere else,
> (b) take discussion of it off this list,
> (c) never bring it up again on this list.

Thank you for letting us know. I apologize for my e-mailing. I didn’t respond 
in the wake for a variety of 
reasons while immediately fully agreeing, of course as I had mainly wondered 
why I got no feedback when 
I’d lastly terminated a thread turning likewise, but no matter anymore.
No problem, as far as it belongs to me, this topic will never be read again 
here nor elsewhere.

Sorry again.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-12 Thread Marcel Schneider via Unicode

On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote:
> 
> Marcel,
> 
> You have put words into my mouth. Please don’t. Your description of what I 
> said is NOT accurate. 
> 
> > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:
> > 
> > And in this thread I wanted to demonstrate that by focusing on the wrong 
> > priorities, i.e. legacy character names instead of
> > the practicability of on-going encoding and the accurateness of specified 
> > decompositions—so that in some instances cedilla
> > was used instead of comma below, Michael pointed out—, ISO/IEC JTC1 SC2/WG2 
> > failed to do its part and missed its mission—
> > and thus didn’t inspire a desire of extensive cooperation (and damaged the 
> > reputation of the whole ISO/IEC).

Michael, I’d better quote your actual e-mail:

On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote:
[…]
> Many things have more than one name. The only truly bad misnomers from that 
> period was related to a mapping error,
> namely, in the treatment of Latvian characters which are called CEDILLA 
> rather than COMMA BELOW. 

Now I fail to understand why this mustn’t be reworded to “the accurateness of 
specified decompositions—so that in some instances 
cedilla was used instead of comma below[.]”
If any correction can be made, I’d be eager to take note.
Thanks for correcting.

Now let’s append the e-mail that I was about to send:

Another ISO Standard that needs to be mentioned in this thread is ISO 15924 
(script codes; not ISO/IEC).
It has a particular status in that Unicode is the Registration Authority. 

I wonder whether people agree that it has a French version. Actually it does 
have a French version, but 
Michael Everson (Registrar) revealed on this List multiple issues with synching 
French script names in 
ISO 15924-fr and in Code Charts translations.

Shouldn’t this content be moved to CLDR? At least with respect to localized 
script names.

Re: The Unicode Standard and ISO

2018-06-12 Thread Marcel Schneider via Unicode

William,

On 12/06/18 12:26, William_J_G Overington wrote:
> 
> Hi Marcel
> 
> > I don’t fully disagree with Asmus, as I suggested to make available 
> > localizable (and effectively localized) libraries of message components, 
> > rather than of entire messages.
> 
> Could you possibly give some examples of the message components to which you 
> refer please?
> 

Likewise I’d be interested in asking Jonathan Rosenne for an example or two of 
automated translation from English to bidi languages with data embedded, 
as on Mon, 11 Jun 2018 15:42:38 +, Jonathan Rosenne via Unicode wrote:
[…]
> > > One has to see it to believe what happens to messages translated 
> > > mechanically from English to bidi languages when data is embedded in the 
> > > text. 

But both would require launching a new thread. 

Thinking hard enough, I’m even afraid that most subscribers wouldn’t be 
interested, so we’d have to move off-list. 

One alternative I can think of is to use one of the CLDR mailing lists. I 
subscribed to CLDR-users when I was directed to move there some technical 
discussion 
about keyboard layouts from Unicode Public.

But now as international message components are not yet a part of CLDR, we’d 
need to ask for extra permission to do so.

An additional drawback of launching a technical discussion right now is that 
significant parts of CLDR data are not yet correctly localized so there is 
another
bunch of priorities under July 11 deadline. I guess that vendors wouldn’t be 
glad to see us gathering data for new structures while level=Modern isn’t 
complete.

In the meantime, you are welcome to contribute and to motivate missing people 
to do the same.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-11 Thread Marcel Schneider via Unicode

On Mon, 11 Jun 2018 16:32:45 +0100 (BST), William_J_G Overington via Unicode 
wrote:
[…]
> Asmus Freytag wrote:
> 
> > If you tried to standardize all error messages even in one language you 
> > would never arrive at something that would be universally useful.
> 
> Well that is a big "If". One cannot standardize all pictures as emoji, but 
> emoji still get encoded, some every year now.
> 
> I first learned to program back in the 1960s using the Algol 60 language on 
> an Elliott 803 mainframe computer, five track paper tape,
> teleprinters to prepare a program on white tape, results out on coloured 
> tape, colours changed when the rolls changed. If I remember
> correctly, error messages, either at compile time or at run time came out as 
> messages of a line number and an error number for compile
> time errors and a number for a run time error. One then looked up the number 
> in the manual or on the enlarged version of the numbers
> and the corresponding error messages that was mounted on the wall.
> 
> > While some simple applications may find that all their needs for 
> > communicating with their users are covered, most would wish they had
> > some other messages available.
> 
> Yes, but more messages could be added to the list much more often than emoji 
> are added to The Unicode Standard, maybe every month
> or every fortnight or every week if needed.
> 
> > To adopt your scheme, they would need to have a bifurcated approach, where 
> > some messages follow the standard, while others do not (cannot).
> 
> Not necessarily. A developer would just need to send in a request to Unicode 
> Inc. to add the needed extra sentences to the list and get a code number.
> 
> > It's pushing this kind of impractical scheme that gives standardizers a bad 
> > name.
> 
> It is not an impractical scheme.

I don’t fully disagree with Asmus, as I suggested to make available localizable 
(and effectively localized) libraries of message components, rather than 
of entire messages. The challenge as I see it is to get them translated to all 
locales. For this I'm hoping that the advantage of improving user support 
upstream instead of spending more time on support fora would be obvious.

By contrast I do disagree with the idea that industrial standards (as opposed 
to governmental procurement) are a safeguard against impractical schemes.
Devising impractical specifications on industrial procurement hasn't even been 
a privilege of the French NB (referring to the examples in my e-mail:
https://unicode.org/mail-arch/unicode-ml/y2018-m06/0082.html
), as demonstrated with the example of the hyphen conundrum where Unicode 
pushes the use of keyboard layouts featuring two distinct hyphens with 
same general category and same behavior, but different glyphs in some fonts 
whose designers didn’t think further than the original point of overly 
disambiguating hyphen semantics—while getting around similar traps with other 
punctuations.

And in this thread I wanted to demonstrate that by focusing on the wrong 
priorities, i.e. legacy character names instead of the practicability of 
on-going 
encoding and the accurateness of specified decompositions—so that in some 
instances cedilla was used instead of comma below, Michael pointed out—, 
ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission—and thus 
didn’t inspire a desire of extensive cooperation (and damaged the reputation 
of the whole ISO/IEC).

Best regards,

Marcel

RE: The Unicode Standard and ISO

2018-06-11 Thread Marcel Schneider via Unicode

> > From the outset, Unicode and the US national body tried repeatedly to 
> > engage with SC35 and SC35/WG5,
[…]
> As a reminder: The actual SC35 is in total disconnect from the same SC35 as 
> it was from the mid-eighties to mid-nineties and beyond.

Edit: ISO/IEC JTC1 SC35 was founded in 1999. (In the mentioned timespan, there 
was SC18/WG9.)

> > informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn’t 
> > appear to be interested
> [, or appeared to be interested in ]
> > a pet project and not in what is actually being used in industry.

It seems it isn’t even a pet project, today it’s just nothing but a deplorable 
mismanagement mess. In my opinion, at 
some point the inadvertant French NB will apologize to the US National Body and 
to the Unicode Consortium.

As of now, I apologize for my part.

Best regards,

Marcel

RE: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode

On Sun, 10 Jun 2018 15:11:48 +, Peter Constable via Unicode wrote:
> 
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to cooperate, despite of
> > repeated proposals for a merger of both instances.
> 
> First, ISO/IEC 15897 is built on a data-format specification, ISO/IEC TR 
> 14652, that never achieved the support
> needed to become an international standard, and has since been withdrawn. 
> (TRs cannot remain TRs forever.)
> Now, JTC1/SC35 began work four or five years ago to create data-format 
> specification for this, Approved Work Item 30112.
> From the outset, Unicode and the US national body tried repeatedly to engage 
> with SC35 and SC35/WG5,

The involvement in this decade of ISO/IEC JTC1 SC35 WG5 adds a scary level of 
complexity unrelated to the core issues. 
Andrew West already hinted that the stuff was moved from SC22 to SC35, but it 
took me some extra investigation to get the point.
As a reminder: The actual SC35 is in total disconnect from the same SC35 as it 
was from the mid-eighties to mid-nineties and beyond.

> informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn’t 
> appear to be interested
[, or appeared to be interested in ]
> a pet project and not in what is actually being used in industry.

Sorry, I experienced some difficulty to understand and filled in what I think 
could have been elided.

> After several failed attempts, Unicode and the USNB gave up trying.

Thank you for bringing up this key information.

> 
> So, any suggestion that Unicode has failed to cooperate or is is dropping the 
> ball with regard to locale data and ISO
> is simply uninformed.

That is exact. 

So I think this thread has now led to a main response, and all concerned people 
on this List are welcome 
to take note of these new facts showing that Unicode is totally innocent in 
ISO/IEC locale data issues.

If that doesn’t suffice to convince missing people to cooperate in reviewing 
French data in CLDR, 
they may be pleased to know that I try to keep helping do our best.

Thank you everyone.

Best regards,

Marcel

> 
> 
> Peter
> 
> 
> From: Unicode  On Behalf Of Mark Davis ?? via Unicode
> Sent: Thursday, June 7, 2018 6:20 AM
> To: Marcel Schneider 
> Cc: UnicodeMailing 
> Subject: Re: The Unicode Standard and ISO
> 
> A few facts.
> 
> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above 
statement is inaccurate.
> 
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to cooperate, despite of
> repeated proposals for a merger of both instances.
> 
> I recall no serious proposals for that.
> 
> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
considerable costs of maintaining synchrony. Completely inadequate structure 
for modern system requirement, no particular industry support, and scant 
content: see Wikipedia for "The registry has not been updated since December 
2001".)
> 
> Mark
> 
[…]

Re: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode

On Sat, 9 Jun 2018 21:21:40 -0700, Steven R. Loomis via Unicode wrote:
> 
> Marcel,
> The idea is not necessarily without merit. However, CLDR does not usually 
>expand scope just because of a suggestion.
>
> I usually recommend creating a new project first - gathering data, looking at 
> and talking to projects to ascertain the usefulness
> of common messages.. one of the barriers to adding new content for CLDR is 
> not just the design, but collecting initial data.
> When emoji or sub-territory names were added, many languages were included 
> before it was added to CLDR.

We know it took years to collect the subterritory names and make sure the list 
and translations are complete.

>
> Also note CLDR does have some typographical terms for use in UI, such as 
> 'bold' and 'italic'

I figure out that these are intended for tooltips on basic formatting 
facilities. High-end software like Microsoft Office has many more and adds 
tooltips showing instructions for use out of a corporate strategy that aims at 
raising usability and overall quality. So I wonder whether there are 
limits for software vendors in cooperating with competitors to mutualize UI 
content? 

This point and others would be cleared in the preliminary stage that you 
drafted above but that I don’t feel in a position to carry out, at least 
not now as I’m focusing on our national data in CLDR and on keyboard layouts 
and standards.

Anyhow, Thank you for letting us know.

Best regards,

Marcel


> Regards,
> Steven
>
On Sat, Jun 9, 2018 at 3:41 PM Marcel Schneider via Unicode  wrote:
>
> On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> > 
> > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > > Still a computer should be understandable off-line, so CLDR providing a 
> > > standard library of error messages could be 
> > > appreciated by the industry
> The kind of translations that CLDR accumulates, like day, and month names, 
> language and territory names, are a widely
> > applicable subset and one that is commonly required in machine generated or 
> > machine-assembled text (like displaying
> > the date, providing pick lists for configuration of locale settings, etc).
> > The universe of possible error messages is a completely different beast.
> > If you tried to standardize all error messages even in one language you 
> > would never arrive at something that would be
> > universally useful. While some simple applications may find that all their 
> > needs for communicating with their users are
> > covered, most would wish they had some other messages available.
>

>
…
> 
> > However, a high-quality terminology database recommends itself (and doesn't 
> > need any procurement standards).
> > Ultimately, it was its demonstrated usefulness that drove the adoption of 
> > CLDR.
> 
> This is why I’m so hopeful that CLDR will go much farther than date and time 
> and other locale settings, and emoji names and keywords.
>

>

Re: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode

On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
[…]
> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name. 
> 
> Especially if it is immediately tied to governmental procurement, forcing 
> people to adopt it (or live with it)
> whether it provides any actual benefit.

Or not. What I left untold is that governmental action does effectively work in 
both directions (examples following),
but governments don’t own that lien of ambivalence out of unbalanced 
discretion. When the French NB positioned 
against encoding Œœ in ISO/IEC 8859-1:1986, it wasn’t the government but a 
manufacturer who wanted to get 
around adding support for this letter in printers. It’s not fully clear to me 
why the same happened to Dutch Ĳĳ. 
Anyway as a result we had (and legacy doing the rest, still have) two digitally 
malfunctioning languages.
Thanks to the work of Hugh McGregor Ross, Peter Fenwick, Bernard Marti and Luek 
Zeckendorf (ISO/IEC 6937:1983), 
and from 1987 on thanks to the work of Joe Becker, Lee Collins and Mark Davis 
from Apple and Xerox, things started 
working fine, and do work the longer the better thanks to Mark Davis’ on-going 
commitment.

Industrial and governmental action both are ambivalent by nature simply because 
human action may happen to be 
short-viewed or far-sighted for a variety of reasons. When the French NB issued 
a QWERTY keyboard standard in 1973
and revised it in 1976, there were short-viewed industrial interests rather 
than governmental procurement. End-users 
never adopted it, there was no market, and it has recently been withdrawn. When 
governmental action, hard scientific 
work, human genius and an up-starting industrialization brought into existence 
a working keyboard for French that is 
usefully transposable to many other locales as well, it was enthousiastically 
adopted by the end-users and everybody 
urged the NB to standardize it. But the industry first asked for an 
international keyboard standard as a precondition… 
(which ended up being an excellent idea as well). The rest of the story may be 
spared as the conclusion is already clear.

There is one impractical scheme that bothers me, and that is that we have two 
hyphens because the ASCII hyphen was 
duplicated as U+2010. Now since font designers (e.g. Lucida Sans Unicode) took 
the hyphen conundrum seriously to 
avoid spoofing, or for whatever reason, we’re supposed to have keyboard layouts 
with two hyphens, both being Gc=Pd. 
That is where the related ISO WG2 could have been useful by positioning against 
U+2010, because disambiguating the 
the minus sign U+2212 and keeping the hyphen-minus U+002D in use like e.g. the 
period would have been sufficient.

On the other hand, it is entirely Unicode’s merit that we have two curly 
apostrophes, one that doesn’t break hashtags 
(U+02BC, Gc=Lm), and one that does (U+2019, Gc=Pf), as has been shared on this 
List (thanks to André Schappo). 
But despite a language being in a position to make a distinct use of each one 
of them, depending on whether the 
apostrophe helps denote a particular sound or marks an elision (and despite of 
having already a physical keyboard and 
driver that would make distinct entry very easy and straightforward), 
submitting feedback didn’t help to raise concern 
so far. This is an example how the industry and the governments united in the 
Unicode Consortium are saving end-users 
lots of trouble.

Thank you.

Marcel

Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode

On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> 
> On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > Still a computer should be understandable off-line, so CLDR providing a 
> > standard library of error messages could be 
> > appreciated by the industry.
>
> The kind of translations that CLDR accumulates, like day, and month names, 
> language and territory names, are a widely
> applicable subset and one that is commonly required in machine generated or 
> machine-assembled text (like displaying
> the date, providing pick lists for configuration of locale settings, etc).
> The universe of possible error messages is a completely different beast.
> If you tried to standardize all error messages even in one language you would 
> never arrive at something that would be
> universally useful. While some simple applications may find that all their 
> needs for communicating with their users are
> covered, most would wish they had some other messages available.

Indeed, error messages althouth technical are like the world’s books, a 
never-ending production of content. To account for 
this infinity, I was not proposing a closed set of messages to replace 
application libraries able to display message #123.
In fact I wrote first: “If to date, automatic [automated] translation of 
technical English still does not work, then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece of information.”
Here the piece of information displayed by the application is like a Lego 
spacecraft, the CLDR messages like Lego bricks.
I didn’t play with Lego since a very long time, but as a boy I learned how it 
works. I even remember that when building 
a construct, it often happened that some bricks were “missing”. A Lego box is 
complete wrt one or several models, but 
once my mom showing me the boxes on the shelves explained that they’re composed 
in a way that you’ll always lack 
something [when trying to build further]. — That doesn’t prevent Lego from 
thriving, nor many people from enjoying.

> To adopt your scheme, they would need to have a bifurcated approach, where 
> some messages follow the standard,
> while others do not (cannot). At that point, why bother? Determining whether 
> some message can be rewritten to follow
> the standard adds another level of complexity while you'd need to have 
> translation resources for all the non-standard ones anyway.

When CLDR libraries will allow to generate 98 % well-translated info boxes, 
human translators may focus on the remaining 
2 %. If for any reason they cannot, yet the vendor will get much less support 
requests than with the ill-translated messages.

> A middle ground is a shared terminology database that allows translators 
> working on different products to arrive at the same translation
> for the same things. Translators already know how to use such databases in 
> their work flow, and integrating a shared one with
> a product-specific one is much easier than trying to deal with a set of 
> random error messages.

If the scheme you outline works well, where come the reported oddities from? 
Obviously terminology is not all, it’s like Lego bricks without studs:
Terms alone don’t interlock and therefore the user cannot make sense. This is 
where CLDR’s hopefully on-coming localizable message bricks enter 
in action, helping automated translation software compose understandable 
output, using patterns. Google translate is unable to do that, as shown 
in the English and French translations of this sentence found in a page of the 
Finnish NB:
https://www.sfs.fi/ajankohtaista/uutiset/nappaimistoon_tarjolla_lisayksia.4249.news

Finnish: Kielitoimiston ohjeen mukaan esimerkiksi vieraskielisissä nimissä on 
pyrittävä säilyttämään kaikki tarkkeet.
Google English: According to the Language Office, for example, in the name of a 
foreign language, it is necessary to maintain all the checkpoints.
Google French: Selon le Language Office, par exemple, au nom d'une langue 
étrangère, il est nécessaire de maintenir tous les points de contrôle.

> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name. 
> 
> Especially if it is immediately tied to governmental procurement, forcing 
> people to adopt it (or live with it) whether it provides any actual benefit.

These statements make much sense to me…

> However, a high-quality terminology database recommends itself (and doesn't 
> need any procurement standards).
> Ultimately, it was its demonstrated usefulness that drove the adoption of 
> CLDR.

This is why I’m so hopeful that CLDR will go much farther than date and time 
and other locale settings, and emoji names and keywords.

Best regards,

Marcel

RE: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode

On the other hand, most end-users don’t appreciate to get “a screenfull of 
all-in-English” when “something happened.”
If even big companies still didn’t succeed in getting automatted computer 
translation to work for error messages, then 
best practice could eventually be to provide an internet link with every 
message. Given that web pages are generally 
less sibylline than error messages, they may be better translatable, and 
Philippe Verdy’s hint is therefore a working 
solution for localized software end-user support.

Still a computer should be understandable off-line, so CLDR providing a 
standard library of error messages could be 
appreciated by the industry.

Best regards,

Marcel 

On Sat, 9 Jun 2018 18:14:17 +, Jonathan Rosenne via Unicode wrote:
> 
> Translated error messages are a horror story. Often I have to play around 
> with my locale settings to avoid them.
> Using computer translation on programming error messages is no way near to 
> being useful.
> 
> Best Regards,
> 
> Jonathan Rosenne
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe 
> Verdy via Unicode
> Sent: Saturday, June 09, 2018 7:49 PM
> To: Marcel Schneider
> Cc: UnicodeMailingList
> Subject: Re: The Unicode Standard and ISO
 

 

2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode :
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> > 
> > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> > 
> > > > Where there is opportunity for productive sync and merging with is
> > > > glibc. We have had some discussions, but more needs to be done-
> > > > especially a lot of tooling work. Currently many bug reports are
> > > > duplicated between glibc and cldr, a sort of manual
> > > > synchronization. Help wanted here.  
> > > 
> > > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > > help.
> > 
> > I wonder how much of that comes under the sad category of "better not
> > translated". If an English speaker has to resort to search engines to
> > understand, let alone fix, a reported problem, it may be better for a
> > non-English speaker to search for the error message in English, and then
> > with luck he may find a solution he can understand.
> 
> Then adding a "Display in English" button in the message box is best practice.
> Still I’ve never encountered any yet, and I guess this is because such a 
> facility 
> would be understood as an admission that up to now, i18n is partly a failure.

 


- Navigate any page on the web in another language than yours, with a Google 
Translate plugin enabled on your browser. you'll have the choice of seeing 
the automatic translation or the original.


 


- Many websites that have pages proposed in multiple languages offers such 
buttons to select the language you want to see (and not necesarily falling 
back to English, becausse the original may as well be in another language and 
English is an approximate translation, notably for sites in Asia, Africa and 
south America).


 


- Even the official websites of the European Union (or EEA) offers such choice 
(but at least the available translations are correctly reviewed for European 
languages; not all pages are translated in all official languages of member 
countries, but this is the case for most pages intended to be read by the 
general public, while pages about ongoing works, or technical reports for 
specialists, or recent legal decisions may not be translated except in a few 
"working languages", generally English, German, and French, sometimes Italian, 
the 4 languages spoken officially in multiple countries in the EEA 
including at least one in the European Union).


 


So it's not a "failure" but a feature to be able to select the language, and to 
know when a proposed translation is fully or partly automated.

Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode

On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> 
> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
> 
> > > Where there is opportunity for productive sync and merging with is
> > > glibc. We have had some discussions, but more needs to be done-
> > > especially a lot of tooling work. Currently many bug reports are
> > > duplicated between glibc and cldr, a sort of manual
> > > synchronization. Help wanted here.  
> > 
> > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > help.
> 
> I wonder how much of that comes under the sad category of "better not
> translated". If an English speaker has to resort to search engines to
> understand, let alone fix, a reported problem, it may be better for a
> non-English speaker to search for the error message in English, and then
> with luck he may find a solution he can understand.

Then adding a "Display in English" button in the message box is best practice.
Still I’ve never encountered any yet, and I guess this is because such a 
facility 
would be understood as an admission that up to now, i18n is partly a failure.

> In a related vein,
> one hears reports of people using English as the interface language,
> because they can't understand the messages allegedly in their native
> language.

If to date, automatic translation of technical English still does not work, 
then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece 
of information. But such an attempt requires that all available human resources 
really 
focus on the project, instead of being diverted by interpersonal discordances. 
Sulking 
people around a project are an indicator of poor project management branding 
dissenters 
as enemies out of an inability to behave in a diplomatic way by lack of social 
skills.
At least that’s what they’d teach you in any management school.

The way Unicode behaves against William Overington is in my opinion a striking 
example 
of mismanagement. In one dimension I can see, the "localizable sentences" that 
William invented and that he actively promotes do fit exactly into the scheme 
of localizable 
information elements suggested in the preceding paragraph. I strongly recommend 
that 
instead of publicly blacklisting the author in the mailbox of the president and 
directing 
the List moderation to prohibit the topic as out of scope of Unicode, an 
extensible and flexible 
framework be designed in urgency under the Unicode‐CLDR umbrella to put an end 
to the 
pseudo‐localization that Richard pointed above.

OK I’m lacking diplomatic skills too, and this e‐mail is harsh, but I see it as 
a true echo.
And I apologize for my last reply to William Overington, if I need to.
http://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0118.html

Beside that, I’d suggest also to add a CLDR library of character name elements 
allowing 
to compose every existing Unicode character name in all supported locales, for 
use in 
system character pickers and special character dialogs. This library should 
then be updated 
at each major release of the UCS. Hopefully this library is then flexible 
enough to avoid 
any Standardese, be it in English, in French, or in any language aping English 
Standardese.
E.g. when the ISO/IEC 10646 mirror of Unicode was published in an official 
French version, 
the official translators felt partly committed to ape English Standardese, of 
which we know 
that it isn’t due mainly to Unicode, but to the then‐head of ISO/IEC JTC1 SC2 
WG2. Not to 
warm up that old grudge, just to show how on‐topic that is. Be it Standardese 
or pseudo‐
localization, the effect is always to worsen UX by missing the point.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode

On Fri, 8 Jun 2018 09:20:09 -0700, Steven R. Loomis via Unicode wrote:
[…]
> But, it sounds like the CLDR process was successful in this case. Thank you 
>for contributing.

You are welcome, but thanks are due to the actual corporate contributors.

[…]
> Actually, I think the particular data item you found is relatively new. The 
> first values entered
> for it in any language were May 18th of this year.  Were there votes for 
> "keycap" earlier?

The "keycap" category is found as soon as in v30 (released 2016-10-05).

> Rather than a tracer finding evidence of neglect, you are at the forefront of 
> progressing the translated data for French. Congratulations!

The neglect is on my part as I neglected to check the data history. 
Please note that I did not make accusations of neglect. Again: The historic 
Code Charts translators, partly still active, sulk CLDR 
because Unicode is perceived as sulking ISO/IEC 15897, so that minimal staff is 
actively translating CLDR for the French locale and can 
legitimately feel forsaken. I even made detailed suppositions as of how it 
could happen that "keycap" remained untranslated.

[…] [Unanswered questions (please refer to my other e‐mails in this thread)]

> The registry for ISO/IEC 15897 has neither data for French, nor structure 
> that would translate the term "Characters | Category | Label | keycap". 
> So there would be nothing to merge with there.

Correct. The only data for French is an ISO/IEC 646 charset:
http://std.dkuug.dk/cultreg/registrations/number/156
As far as I can see there are available data to merge for Danish, Faroese, 
Finnish Greenlandic, Norwegian, and Swedish.

> So, historically, CLDR began not a part of Unicode, but as part of Li18nx 
> under the Free Standards Group. See the bottom of the page 
> http://cldr.unicode.org/index/acknowledgments
> "The founding members of the workgroup were IBM, Sun and OpenOffice.org". 
> What we were trying to do was to provide internationalized content for Linux, 
> and also, to resolve the then-disparity between locale data
> across platforms. Locale data was very divergent between platforms - spelling 
> and word choice changes, etc.  Comparisons were done
> and a Common locale data repository  (with its attendant XML formats) 
> emerged. That's the C in CLDR. Seed data came from IBM’s ICIR
> which dates many decades before 15897 (example 
> http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/
> - 4th edition published in 1994.) 100 locales we contributed to glibc as well.

Thank you for the account and resources. The Linux Internationalization 
Initiative appears to have issued a last release on August 23, 2000:
https://www.redhat.com/en/about/press-releases/83
the year before ISO/IEC 15897 was lastly updated:
http://std.dkuug.dk/cultreg/registrations/chreg.htm

> Where there is opportunity for productive sync and merging with is glibc. We 
> have had some discussions, but more needs to be
> done- especially a lot of tooling work. Currently many bug reports are 
> duplicated between glibc and cldr, a sort of manual synchronization.
> Help wanted here. 

Noted. For my part, sadly for C libraries I’m unlikely to be of any help.

Marcel

Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode

On Fri, 8 Jun 2018 16:54:20 -0400, Tom Gewecke via Unicode wrote:
> 
> > On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode  wrote:
> > 
> > People relevant to projects for French locale do trace the borderline of 
> > applicability wider 
> > than do those people who are closerly tied to Unicode‐related projects.
> 
> Could you give a concrete example or two of what these people mean by “wider 
> borderline of applicability”
> that might generate their ethical dilemma?
> 

Drawing the borderline until which ISO/IEC should be among the involved 
parties, as I put it, is about the Unicode policy 
as of how ISO/IEC JTC1 SC2 WG2 is involved in the process, how it appears in 
public (FAQs, Mailing List responding practice, 
and so on), and how people in that WG2 feel with respect to Unicode. That may 
be different depending on the standard concerned 
(ISO/IEC 10646, ISO/IEC 14651), so that the former is put in the first place as 
vital to Unicode, while the latter is almost entirely 
hidden (except in appendix B of UTS #10).

Then when it’s up to locale data, Unicode people see the borderline below, 
while ISO people tend to see it above. This is why 
Unicode people do not want the twin‐standards‐bodies‐principle applied to 
locale data, and are ignoring or declining any attempt 
to equalize situations, arguing that ISO/IEC 15897 is useless. As I’ve pointed 
in my previous e‐mail responding to Asmus Freytag, 
ISO/IEC 10646 was about as useless until Unicode came on it and merged itself 
with that UCS embryo (not to say that miscarriage 
on the way). The only thing WG2 could insist upon were names and huge bunches 
of precomposed or preformatted characters that 
Unicode was designed to support in plain text by other means. The essential 
part was Unicode’s, and without Unicode we wouldn’t 
have any usable UCS. ISO/IEC 15897 appears to be in a similar position: not 
very useful, not very performative, not very complete. 
But an ISO/IEC standard. Logically, Unicode should feel committed to merge with 
it the same way it did with the other standard, 
maintaining the data, and publishing periodical abstracts under ISO coverage. 
There is no problem in publishing a framework standard 
under the ISO/IEC umbrella, associated with a regular up‐to‐date snapshot of 
the data.

That is what I mean when I say that Unicode arbitrarily draw borderlines of 
their own, regardless of how people at ISO feel about them.

Marcel

Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode

On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:
> 
[…]
> There's no value added in creating "mirrors" of something that is 
> successfully being developed and maintained under a different umbrella.

Wouldn’t the same be true for ISO/IEC 10646? It has no value added neither, and 
WG2 meetings could be merged with UTC meetings.
Unicode maintains the entire chain, from the roadmap to the production tool 
(that the Consortium ordered without paying a full license).

But the case is about part of the people who are eager to maintain an alternate 
forum, whereas the industry (i.e. the main users of the data) 
are interested in fast‐tracking character batches, and thus tend to shortcut 
the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying 
the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why 
it was not, is that Unicode was weaker and needed support 
from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being 
useless in practice, as it pursued an unrealistic encoding scheme.
To overcome this, somebody in ISO started actively campaigning for the Unicode 
encoding model, encountering fierce resistance from fellow 
ISO people until he succeeded in teaching them real‐life computing. He had 
already invented and standardized the sorting method later used 
to create UCA and ISO/IEC 14651. I don’t believe that today everybody forgot 
about him.

Marcel

Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode

On Fri, 8 Jun 2018 08:50:28 -0400, Tom Gewecke via Unicode wrote:
> 
> 
> > On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode  wrote:
> > 
> > What bothered me ... is that the registration of the French locale in CLDR 
> > is 
> > still surprisingly incomplete
> 
> Could you provide an example or two?
> 

What got me started is that "Characters | Category | Label | keycap" remained 
untranslated, i.e. its French translation was "keycap". 

A number of keyword translations are missing or wrong. I can tell that all 
actual contributors are working hard to fix the issues.
I can imagine that it’s by lack of time in front of the huge mass of data, or 
by feeling so alone (only three corporate contributors, 
no liaison or NGOs). No wonder if the official French translators are all 
sulking the job (reportedly, not me figuring out).

Marcel

Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode

nts to give me $1M. Since I don't 
> think it is worth my time,
> or am not willing to upfront the low, low fee of $10K, I might "ignore" the 
> email, or "not respond" to it.
> Or I might "decline" it with a no-thanks or not-interested response. But none 
> of that is to "refuse" it. 

Thanks, I got it (the point, and the e‐mail).

More seriously, to ignore or not to respond to, or even to decline a suggestion 
made by a well‐known high official is in my opinion as much as 
to refuse that proposition. Beyond that, I think I’d be unable to carve out any 
common denominator with an unsolicited bulk e‐mail.

Marcel

>
> On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode  wrote:
>
> > On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> > 
> > > I cannot but fully agree with Mark and Michael.
> > >
> > > Sincerely
> > 
> >
> >Thank you for confirming. All witnesses concur to invalidate the statement 
> >about 
> >uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in 
> >its 
> >actual form, sorting was standardized simultaneously in ISO/IEC 14651 and in 
> >Unicode Collation Algorithm, the latter including practice‐oriented extra 
> >features. 
> >Since then, these two standards are kept in synchrony uninterruptedly.
> >
> >Getting people to correct the overall response was not really my initial 
> >concern, 
> >however. What bothered me before I learned that Unicode refuses to cooperate 
> >with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR 
> >is 
> >still surprisingly incomplete despite the meritorious efforts made by the 
> >actual 
> >contributors, and then after some investigation, that the main part of the 
> >potential 
> >French contributors are prevented from cooperating because Unicode refuses 
> >to 
> >cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, 
> >reportedly after many attempts made to merge both standards, remaining
> >unsuccessful without any striking exposure or friendly agreement to avoid 
> >kind of 
> >an impression of unconcerned rebuff.
> > 
> >Best regards,
> >
> >Marcel
> 
>

Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode

On Fri, 8 Jun 2018 00:43:04 +0200, Philippe Verdy via Unicode wrote:
[cited mail]
>
> The "normative names" are in fact normative only as a forward reference
> to the ISO/IEC repertoire becaus it insists that these names are essential 
> part
> of the stable encoding policy which was then integrated in the Unicode 
> stability rules,
> so that the normative reference remains stable as well). Beside this, Unicode 
> has other
> more useful properties. People don't care at all about these names.

Effectively we have learned to live even with those that are uselessly 
misleading and had 
been pushed through against better proposals made on Unicode side, particularly 
the 
wrong left/right attributes. Unicode have worked hard to palliate these 
misnomers by 
introducing the bidi_bracket (yes, no) and bidi_bracket_type (open, close) 
properties, 
and specifying in TUS that beside a few exceptions, LEFT and RIGHT in names of 
paired punctuation is to be read as OPENING and CLOSING, respectively.

> The character properties and the related algorithms that use them (and even
> the representative glyph even if it's not stabilized) are much more important
> (and the ISO/IEC 101646 does not do anything to solve the real encoding 
> issues,
> and needed properties for correct processing). Unicode is more based on 
> commonly
> used practices and allows experimetnation and progressive enhancing without 
> having
> to break the agreed ISO/EIC normative properties. The position of Unicode is 
> more
> pragmatic, and is much more open to lot of contibutors than the small ISO/IEC 
> subcomities
> with in fact very few active members, but it's still an interesting 
> counter-power that allows
> governments to choose where it is more useful to contribute and have 
> influence when
> the industry may have different needs and practices not foàllowing the 
> government
> recommendations adopted at ISO.

Now it becomes clear to me that this opportunity of governmental action is 
exactly what 
could be useful when it’s up to fix the textual appearance of national user 
interfaces, and 
that is exactly why not federating communities around CLDR, and not attempting 
to get 
efforts converge, is so counter‐productive.

Thanks for getting this point out.

Best regards,

Marcel

RE: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode

On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> 
> I cannot but fully agree with Mark and Michael.
> 
> Sincerely
> 

Thank you for confirming. All witnesses concur to invalidate the statement 
about 
uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in its 
actual form, sorting was standardized simultaneously in ISO/IEC 14651 and in 
Unicode Collation Algorithm, the latter including practice‐oriented extra 
features. 
Since then, these two standards are kept in synchrony uninterruptedly.

Getting people to correct the overall response was not really my initial 
concern, 
however. What bothered me before I learned that Unicode refuses to cooperate 
with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR is 
still surprisingly incomplete despite the meritorious efforts made by the 
actual 
contributors, and then after some investigation, that the main part of the 
potential 
French contributors are prevented from cooperating because Unicode refuses to 
cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, 
reportedly after many attempts made to merge both standards, remaining
unsuccessful without any striking exposure or friendly agreement to avoid kind 
of 
an impression of unconcerned rebuff.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode

On Thu, 17 May 2018 22:26:15 +, Peter Constable via Unicode wrote:
[…]
> Hence, from an ISO perspective, ISO 10646 is the only standard for which 
> on-going
> synchronization with Unicode is needed or relevant. 

This point of view is fueled by the Unicode Standard being traditionally 
thought of as a mere character set, 
regardless of all efforts—lastly by first responder Asmus Freytag himself—to 
widen the conception.

On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
>
> It would be great if mutual synchronization were considered to be of benefit.
> Some of us in SC2 are not happy that the Unicode Consortium has published 
> characters
> which are still under Technical ballot. And this did not happen only once. 

I’m not happy catching up this thread out of time, the less as it ultimately 
brings me where I’ve started 
in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger 
infiltrated into Unicode.
This is the very thing I did not vent in my first reply. From my point of view, 
this misfortune would be 
reason enough for Unicode not to seek further cooperation with ISO/IEC.

But I remember the many voices raising on this List to tell me that this is all 
over and forgiven.
Therefore I’m confident that the Consortium will have the mindfulness to 
complete the ISO/IEC JTC 1 
partnership by publicly assuming synchronization with ISO/IEC 14651, and 
achieving a fullscale merger 
with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, 
and ISO/IEC 15897 would 
be its ISO mirror. 

That is a matter of smart diplomacy, that Unicode may prove again to be great 
in.

Please consider making this move.

Thanks,

Marcel

Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode

On Thu, 7 Jun 2018 15:20:29 +0200, Mark Davis ☕️ via Unicode wrote:
> 
> A few facts. 
>
> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
>
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the
> synchronization level in more detail, but the above statement is inaccurate.
>
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to
> > cooperate, despite of repeated proposals for a merger of both instances.
> 
> I recall no serious proposals for that. 
> 
> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought
> no value to the table. Certainly nothing to outweigh the considerable costs 
> of maintaining synchrony.
> Completely inadequate structure for modern system requirement, no particular 
> industry support, and
> scant content: see Wikipedia for "The registry has not been updated since 
> December 2001".)

Thank you for correcting as of the Unicode ISO/IEC 14651 synchrony; indeed 
while on

http://www.unicode.org/reports/tr10/#Synch_ISO14651

we can read that “This relationship between the two standards is similar to 
that maintained between
the Unicode Standard and ISO/IEC 10646[,]” confusingly there seems to be no 
related FAQ. Even more 
confusingly, a straightforward question like “I was wondering which ISO 
standards other than ISO 10646 
specify the same things as the Unicode Standard” remains ultimately unanswered. 

The reason might be that the “and of those, which ones are actively kept in 
sync” part is really best 
answered by “none.” In fact, while UCA is synched with ISO/IEC 14651, the 
reverse statement is 
reportedly false. Hence, UCA would be what is called an implementation of 
ISO/IEC 14651.

Nevertheless, UAX #10 refers to “The synchronized version of ISO/IEC 14651[,]” 
and mentions a 
“common tool[.]” 

Hence one simple question: Why does the fact that the Unicode-ISO synchrony 
encompasses *two* 
standards remain untold in the first places?

As of ISO/IEC 15897, it would certainly be a piece of good diplomacy that 
Unicode pick the usable 
data in the existing set, and then ISO/IEC 15897 will be in a position to cite 
CLDR as a normative 
reference so that all potential contributors are redirected and may feel free 
to contribute to CLDR.

And it would be nice that Unicode don’t forget to order an additional FAQ about 
the topic, please.

Thanks,

Marcel

Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode

On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
> 
> On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > Hello,
> >
> > There are several mentions of synchronization with related standards in
> > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> > https://www.unicode.org/faq/unicode_iso.html. However, all such mentions
> > never mention anything other than ISO 10646.
> 
> Because that is the standard for which there is an explicit understanding by 
> all involved
> relating to synchronization. There have been occasionally some challenging 
> differences
> in the process and procedures, but generally the synchronization is being 
> maintained,
> something that's helped by the fact that so many people are active in both 
> arenas.

Perhaps the cause-effect relationship is somewhat unclear. I think that many 
people being 
active in both arenas is helped by the fact that there is a strong will to 
maintain synching.

If there were similar policies notably for ISO/IEC 14651 (collation) and 
ISO/IEC 15897 
(locale data), ISO/IEC 10646 would be far from standing alone in the field of 
Unicode-ISO/IEC cooperation.

> 
> There are really no other standards where the same is true to the same extent.
> >
> > I was wondering which ISO standards other than ISO 10646 specify the
> > same things as the Unicode Standard, and of those, which ones are
> > actively kept in sync. This would be of importance for standardization
> > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > ISO standards is generally preferred in ISO standards.
> >
> One of the areas the Unicode Standard differs from ISO 10646 is that its 
> conception
> of a character's identity implicitly contains that character's properties - 
> and those are
> standardized as well and alongside of just name and serial number.

This is probably why, to date, ISO/IEC 10646 features character properties by 
including 
normative references to the Unicode Standard, Standard Annexes, and the UCD.
Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1:

“[…] The list of these characters is determined by having the ‘Bidi_Mirrored’ 
property 
set to ‘Y’ in the Unicode Standard. These values shall be determined according 
to 
the Unicode Standard Bidi Mirrored property (see Clause 2).”

> 
> Many of these properties have associated with them algorithms, e.g. the bidi 
> algorithm,
> that are an essential element of data interchange: if you don't know which 
> order in
> the backing store is expected by the recipient to produce a certain display 
> order, you
> cannot correctly prepare your data.
> 
> There is one area where standardization in ISO relates to work in Unicode 
> that I can
> think of, and that is sorting.

Yet UCA conforms to ISO/IEC 14651 (where UCA is cited as entry #28 in the 
bibliography).
The reverse relationship is irrelevant and would be unfair, given that the 
Consortium
refused till now to synchronize UCA and ISO/IEC 14651.

Here is a need for action.

> However, sorting, beyond the underlying framework,
> ultimately relates to languages, and language-specific data is now housed in 
> CLDR.
> 
> Early attempts by ISO to standardize a similar framework for locale data 
> failed, in
> part because the framework alone isn't the interesting challenge for a 
> repository,
> instead it is the collection, vetting and management of the data.

For another part it failed because the Consortium refused to cooperate, despite 
of 
repeated proposals for a merger of both instances.

> 
> The reality is that the ISO model and its organizational structures are not 
> well suited
> to the needs of many important area where some form of standardization is 
> needed.
> That's why we have organization like IETF, W3C, Unicode etc..
> 
> Duplicating all or even part of their effort inside ISO really serves 
> nobody's purpose.

An undesirable side-effect of not merging Unicode with ISO/IEC 15897 (locale 
data) is 
to divert many competent contributors from monitoring CLDR data, especially for 
French.

Here too is a huge need for action.

Thanks in advance.

Marcel

Unicode 11.0.0: BidiMirroring.txt

2018-06-07 Thread Marcel Schneider via Unicode

In the wake of the new release, may we discuss the reason why UTC persisted in 
recommending that 3 pairs of mathematical symbols featuring tildes are mirrored 
in low-end support by glyph-exchange bidi-mirroring, with the result that 
legibility 
of tildes is challenged, as demonstrated for “Remedial 11” in:

https://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html

(This was written up for meeting #153, while an outdated alias file with 
roughly 
same content was discussed again at meeting 154, while there were also some 
related items in the general feedback hopper.)

Thanks,

Marcel

Re: More scripts, not more emoji (Re: Accessibility Emoji)

2018-04-15 Thread Marcel Schneider via Unicode

On Sat, 14 Apr 2018 20:29:40 -0700, Markus Scherer <markus@gmail.com> wrote:
>
> On Sat, Apr 14, 2018 at 5:50 PM, Marcel Schneider via Unicode  wrote:
> >
> > We need to get more scripts into Unicode, not more emoji.
> >
> > That is — somewhat inflated — the core message of a NYT article published 
> > six months ago,
> > and never shared here (no more than so many articles about Unicode, 
> > scripts, and emoji).
> > Some 100 scripts are missing in the Standard, affecting as many as 400 
> > million people worldwide.
> >
> > https://www.nytimes.com/2017/10/18/magazine/how-the-appetite-for-emojis-complicates-the-effort-to-standardize-the-worlds-alphabets.html
>
> You are right. One good way that you can help make it happen is to support 
> the Script Encoding Initiative which is mentioned in the article.
>
> Some of the AAC money goes there. And since the most popular adopted 
> characters are emoji, their popularity is helping close the gap that you
> pointed out.
>
>
> They have also helped in other ways -- they really motivated developers to 
> make their code work for supplementary code points, grapheme cluster 
> boundaries, font ligatures, spurred development of color font technology, and 
> got organizations to update to newer versions of Unicode faster than
> before. Several of these things are especially useful for recently added 
> scripts.

Thank you for the point. 

Indeed, the NYT article, too, is much more balanced than what I bounced to the 
List as an exaggerated takeaway. 

We send our thanks to the sponsors of the Adopt A Character program, to the 
SEI, and to the United States National Endowment for the 
Humanities, which funded the Universal Scripts Project. And last but not least, 
to the Unicode Consortium.

I note, too, that the cited 400 million people do write in less than fifty yet 
unsupported – but hopefully soon encoded – scripts.

Best regards,

Marcel

More scripts, not more emoji (Re: Accessibility Emoji)

2018-04-14 Thread Marcel Schneider via Unicode


We need to get more scripts into Unicode, not more emoji.

That is — somewhat inflated — the core message of a NYT article published six 
months ago,
and never shared here (no more than so many articles about Unicode, scripts, 
and emoji).
Some 100 scripts are missing in the Standard, affecting as many as 400 million 
people worldwide.

https://www.nytimes.com/2017/10/18/magazine/how-the-appetite-for-emojis-complicates-the-effort-to-standardize-the-worlds-alphabets.html

(Just found while searching for Hanifi Rohingya script, thanks to the Wikipedia 
entry 
[trying to find out whether to include Hanifi Rohingya in beta feedback 
{closing soon}]).

On 01/04/18 08:27 Nathan Galt via Unicode wrote
> 
> I predict that these emoji will be extraordinarily popular in insults between 
> gamers on both Twitch and Discord. I’d wager, with suitable metrics 
> available, that using these for insult purposes will be the majority of all 
> accessibility-emoji use worldwide. Expected meanings:
> 
> - PERSON WITH WHITE CANE: “the person under discussion didn’t see that guy 
> who killed him/his partner/his whole team”
> - DEAF SIGN: “the person under discussion failed to notice an audio cue that 
> would have prevented his/his partner’s/his team’s death(s)”
> - PERSON IN MECHANIZED WHEELCHAIR: “the person under discussion failed to 
> properly press keys and move his mouse as he should have and his mechanical 
> failures caused his/his partner’s/his team's death(s)”
> 
> I don’t think the cultural impact of these will be as uniformly positive as 
> Apple hopes.
> 
> 
> > On Mar 26, 2018, at 9:51 AM, William_J_G Overington via Unicode  wrote:
> > 
> > I have been looking with interest at the following publication.
> > 
> > Proposal For New Accessibility Emoji
> > 
> > by Apple Inc.
> > 
> > www.unicode.org/L2/L2018/18080-accessibility-emoji.pdf
> > 
> > I am supportive of the proposal. Indeed please have more such emoji as well.
> > 
> > [snip]
> > 
> > How could the accessibility emoji in the proposal be used in practice?
> > 
> > William Overington
> > 
> > Monday 26 March 2018
> 
> 
>

Re: Accessibility Emoji

2018-03-29 Thread Marcel Schneider via Unicode

William,
 
On 29/03/18 17:03 William_J_G Overington via Unicode wrote:
> 
> I have been thinking about issues around the proposal.
> http://www.unicode.org/L2/L2018/18080-accessibility-emoji.pdf
> There is a sentence in that document that starts as follows.
> 
> > Emoji are a universal language and a powerful tool for communication, 
 
That is clearly overstating the capabilities of emoji, and ignoring the 
borderline 
between verbal and pictographic expression. The appropriateness of each one 
depends mainly on semantics and context. The power of emoji may rely in their 
being polysemic, escaping censorship as already discussed during past years.
> 
> It seems to me that what is lacking with emoji are verbs and pronouns.
 
Along with these, one would need more nouns, too, setting up an autonomous 
language. That however is not the goal of emoji and is outside the scope of 
Unicode.
> 
> For example, "to be", "to have" and "to need". The verb "to need" might well 
be of particular importance in relation to accessibility considerations.
 
When accessibility matters, devices may be missing, and then the symbol charts 
are most appropriate, as seen. When somebody is pointing an object, the ‷need” 
case is most obvious anyway. Impaired persons may use a bundle of cards 
including 
textual messages. None of these justifies encoding extra emoji. E.g. when 
somebody 
wishes a relative to buy more bread while returning from work, the appropriate 
number
of loaves followed by an exclamation mark and a smile or heart may do it.
> 
> How could verbs be introduced into emoji? The verb "to love" can already be 
> indicated using a heart symbol.
 
This is the one that people are likely to be most embarrassed typing out. 
> 
> Should abstract designs be used? Or should emoji always be pictographic?
 
Yes, they should always be highly iconic, Asmus explained in detail. See:
 
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0014.html
> 
> If abstract designs were introduced would it be possible for the standards 
> documents to include the meanings
> or would the standards documents need to simply use a geometrical description 
> and then the meanings be
> regarded as a higher level protocol outside of the standard?
 
On one hand, Unicode does not encode semantics; but on the other hand, on 
character level, semantics are 
part of the documentation accompanying a number of characters in the Charts. 
There is a balance between 
polysemics and disambiguation. As a thumb rule: characters are disambiguated to 
ensure correct processing
of the data, so far as the cost induced by handling multiple characters doesn’t 
outweigh the benefit. 
In putting your question, you already answered it, except that there are 
geometric figures encoded for UIs, 
that therefore already have a meaning, yet are mostly generically named, 
leaving the door open to alternate 
semantics.
> 
> For, if abstract emoji were introduced with the intention of them to be of 
> use as verbs in a universal language,
> it would be of benefit if the meanings were in the standard.
 
But such a language has clearly been stated as being out of scope of Unicode, 
and we aren’t even allowed 
to further discuss that particular topic, given the mass of threads and e‐mails 
already dedicated to it in the past.
> 
> If abstract designs were used then the meanings would need to be learned. Yet 
> if the meanings were
> universal that could be a useful development.
 
It would not, because automatic translation tools already cater for these 
needs, and possibly better. See:
 
http://unicode.org/pipermail/unicode/2015-October/003005.html
> 
> I have wondered whether verb tenses could be usefully expressed using some of 
> the existing combining
> accent characters following an emoji verb character..
 
First of all, users should be likely to adopt the scheme in a fairly 
predictable way. I’m ignoring actual trends 
and can only repeat what has been said on this list: communities are missing, 
and so is interest. 
Hence, sadly to say, there is little through no point in elaborating further.
Personally I’m poorly armed to help building a user community, as I don’t have 
a smartphone, while being 
very busy with more and more tasks, leaving little time for many experiments.  
Sorry.
 
Best regards,
 
Marcel
> 
> For example, U+0302 COMBINING CIRCUMFLEX ACCENT to indicate that the verb is 
> in the future tense, U+0304 COMBINING MACRON to indicate that the verb is in 
> the present tense, U+030C COMBINING CARON to indicate that the verb is in the 
> past tense, U+0303 COMBINING TILDE to indicate that the verb is in the 
> conditional tense.
> 
> The desirability of pronouns was raised by a gentleman in the audience of a 
> lecture at the Internationalization and Unicode Conference in 2015.
> 
> I tried to produce some designs. I could not find a way to do that with 
> conventional illustrative pictures, though I did produce a set of abstract 
> designs that could

Re: Translating the standard

2018-03-13 Thread Marcel Schneider via Unicode

On Tue, 13 Mar 2018 16:48:51 -0700, Asmus Freytag (c) via Unicode wrote:

On 3/13/2018 12:55 PM, Philippe Verdy wrote:

It is then a version of the matching standards from Canadian and French 
standard bodies. This does not make a big difference, except that those 
national standards (last editions in 2003) are not kept in sync with evolutions 
of the ISO/IEC standard. So it can be said that this was a version for the 2003 
version of the ISO/IEC standard, supported and sponsored by some of their 
national members.


There is a way to transpose international standards to national standards, but 
they then pick up a new designation, e.g. ANSI for US or DIN for German or EN 
for European Norm.

A./



2018-03-13 19:38 GMT+01:00 Asmus Freytag via Unicode :


On 3/13/2018 11:20 AM, Marcel Schneider via Unicode wrote:

On Mon, 12 Mar 2018 14:55:28 +, Michel Suignard wrote:


Time to correct some facts.
The French version of ISO/IEC 10646 (2003 version) were done in a separate 
effort by Canada and France NBs and not within SC2 proper. 
...


Then it can be referred to as “French version of ISO/IEC 10646” but I’ve got 
Andrew’s point, too.


Correction: if a project is not carried out by SC2 (the proper ISO/IEC 
subcommittee) then it is not a "version" of the ISO/IEC standard.

A./
 





Thanks for correction. And I confess and apologize that on Patrick’s French 
Unicode 5.0 Code Charts page (
http://hapax.qc.ca/Tableaux-5.0.htm
), there is no instance of "version", although the item is referred to as "ISO 
10646:2003 (F)", from which it can ordinarily be inferred that "ISO" did back 
the project and that it is considered as the French version of the standard.
 
I wasn’t aware that this kind of parsing the facts is somewhat informal and 
shouldn’t be handled on mailing lists without a caveat. 
 
That said, the French transposition of ISO/IEC 10646 was not carried out as 
just sort of a joint venture of Canada and France (which btw has stepped out, 
leaving Québec alone supporting the cost of future editions! Really ugly), 
given that it got feedback from numerous countries, part of which was written 
in French, and went through a heavy ballot process. Thus, getting it changed is 
not easy since it was approved by the time, and any change requests should be 
documented and are primarily damageable as threatening stability. Name changes 
affecting rare characters prove to be feasible, while on the other hand, 
syncing the French name of U+202F with common practice and TUS is obviously 
more complicated, which in turn compromises usability in UIs, where we’re 
therefore likely to use descriptors i.e. altered names for roughly half of the 
characters bearing a specific name. Somehow the same rationale as for UTN #24 
but somewhat less apposite given that the French transposition is not 
constrained by stability policies.
 
Best regards,
 
Marcel

RE: Translating the standard

2018-03-13 Thread Marcel Schneider via Unicode

On Mon, 12 Mar 2018 14:55:28 +, Michel Suignard wrote:
> 
> Time to correct some facts.
> The French version of ISO/IEC 10646 (2003 version) were done in a separate 
> effort by Canada and France NBs and not within SC2 proper. 
> National bodies are always welcome to try to transpose and translate an ISO 
> standard. But unless this is done by the ISO Sub-committee
> (SC2 here) itself, this is not a long-term solution. This was almost 15 years 
> ago. I should know, I have been project editor for 10646 since 
> October 2000 (I started as project editor in 1997 for part-2, and been 
> involved in both Unicode and SC2 since 1990).

Then it can be referred to as “French version of ISO/IEC 10646” but I’ve got 
Andrew’s point, too.

> 
> Now to some alternative facts:
> >Since ISO has made of standards a business, all prior versions are removed 
> >from the internet, 
> >so that they donʼt show up even in that list (which Iʼd used to grab a free 
> >copy, just to check
> > the differences). Because if they had public archives of the free 
> > standards, not having any 
> >for the pay standards would stand out even more.
> >This is why if you need an older version for reference, you need to find a 
> >good soul in
> > the organization, who will be so kind to make a copy for you in the 
> > archives at the
> > headquarters.
> 
> OK, yes, the old versions are removed from the ISO site. Andrew has probably 
> easier access to older versions than you through BSI.
> He has been involved directly in SC2 work for many years. The 2003 version is 
> completely irrelevant now anyway and again was not
> done by the SC, there was never a project editor for a French version of 
> 10646.

Call him whatever, how can a project thrive without a head?

I think relevance is not the only criterium in evaluating a translation. The 
most important would probably 
be usefulness. Older versions are an appropriate means to get in touch with 
Unicode, as discussed when 
some old core specs were proposed on this list.

> 
> >The last published French version of ISO/IEC 10646 — to which you 
> >contributed — is still available on
> > Patrickʼs site:
> >
> >http://hapax.qc.ca/Tableaux-5.0.htm
> 
> The only live part of that page is the code chart and does not correspond to 
> the 1064:2003 itself (they are in fact Unicode 5.0 charts,
> however close to 10646:2003 and its first 2 amendments), I am not sure the 
> original 10646:2003 (F), and the 2 translated amendments
> (1 and 2) are available anywhere and are totally obsolete today anyway. Only 
> Canada and/or Afnor may still have archived versions.

Given that for each time some benevolent people have their nameslist 
translation ready for print, 
they have to pay the tool and the fonts — just plainly disgusting. 

No wonder once you get such a localized Code Charts edition printed out in PDF, 
it has everlasting value!

> 
> >(Iʼd noticed that the contributorsʼ list has slightly shrinked without being 
> >able to find out why.)
> > The Code Charts have not been produced, however (because there is actually 
> > no
> > redactor‐in‐chief, as already stated, and also because of budget cuts the 
> > government is not in
> > a position to pay the non‐trivial amount of money asked for by Unicode for 
> > use of the fonts
> > and/or [just trying to be as precise as I can this time| the owner of the 
> > tooling needed).
> 
> A bunch of speculation here, never was a 'redactor-in-chief' for French 
> version, Unicode never asked for money because first of all
> it does not own the tool (it is licensed by the tool owner who btw does this 
> work as a giant goodwill gesture, based on the money received
> and the amount of work required to get this to work).

Shame! Unicode should manage to get the funding — no problem for Apple! (but 
for Microsoft who had to fire many employees) —
so that the developer is fully paid and rewarded. Why has Unicode no unlimited 
license? Because of the stinginess of those corporate
members that have plenty of money to waste. I’ll save that off‐topic rant but 
without ceasing to insist that he must be paid, fully paid
and paid back and paid in the future, the more as the Code Charts are now 
printed annually and grow bigger and bigger.
It’s really up to the Consortium to gather the full license fee from their 
corporate members for the English version and any other 
interested locale. Unicode’s claim of mission encompasses logically making 
available for free as many localized Code Charts and
whatever else so far as benevolent people translate the sources. 

Shouldn’t that have been clear from the beginning on?

> In a previous message you also made some speculation about Apple role or 
> possibility that have no relationship with reality.
> 
> >Having said that, I still believe that all ISO standards should have a 
> >French version, shouldnʼt they? 
> 
> You are welcome to contribute to that. Good luck though.
> 
> On a side note, I have been working with the

Re: Translating the standard

2018-03-12 Thread Marcel Schneider via Unicode

On Mon, 12 Mar 2018 10:00:16 +, Andrew West wrote:
> 
> On 12 March 2018 at 07:59, Marcel Schneider via Unicode
>  wrote:
> >
> > Likewise ISO/IEC 10646 is available in a French version
> 
> No it is not, and never has been.
> 
> Why don't you check your facts before making misleading statements to this 
> list?
> 
> > or at least, it should have an official French version like all ISO 
> > standards.
> 
> That is also blatantly untrue.
> 
> Only six of the publicly available ISO standards listed at
> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
> have French versions, and one has a Russian version. You will notice
> that there is no French version of ISO/IEC 10646.
> 
> Andrew

Since ISO has made of standards a business, all prior versions are removed from 
the internet, so that they donʼt show up even in that list (which Iʼd used to 
grab a 
free copy, just to check the differences). Because if they had public archives 
of the 
free standards, not having any for the pay standards would stand out even more.
This is why if you need an older version for reference, you need to find a good 
soul 
in the organization, who will be so kind to make a copy for you in the archives 
at 
the headquarters.

The last published French version of ISO/IEC 10646 — to which you contributed — 
is still available on Patrickʼs site:

http://hapax.qc.ca/Tableaux-5.0.htm

Actually, the French version has no chief redactor, and during a time, the 
French 
version of the NamesList was maintained only so far as to add the new names 
(for 
use in ISO 14651). For Unicode 10.0.0, the French translation has been again 
fully 
updated to Code Charts production level:

http://hapax.qc.ca/ListeNoms-10.0.0.txt

(Iʼd noticed that the contributorsʼ list has slightly shrinked without being 
able to 
find out why.) The Code Charts have not been produced, however (because there 
is actually no redactor‐in‐chief, as already stated, and also because of budget 
cuts 
the government is not in a position to pay the non‐trivial amount of money 
asked 
for by Unicode for use of the fonts and/or [just trying to be as precise as I 
can this 
time| the owner of the tooling needed).

Having said that, I still believe that all ISO standards should have a French 
version,
shouldnʼt they? :)

Best regards,

Marcel

Re: Translating the standard

2018-03-12 Thread Marcel Schneider via Unicode

On Mon, 12 Mar 2018 07:39:53 +, Alastair Houghton wrote:
> 
> On 11 Mar 2018, at 21:14, Marcel Schneider via Unicode  wrote:
> > 
> > Indeed, to be fair. And for implementers, documenting themselves in English 
> > may scarcely ever have much of a problem, no matter whatʼs the locale.
> 
> Agreed. Implementers will already understand English; you can’t write 
> computer software
> without, since almost all documentation is in English, almost all computer 
> languages are
> based on English, and, to be frank, a large proportion of the software market 
> is itself
> English speaking. I have yet to meet a software developer who didn’t speak 
> English.
> 
> That’s not to say that people wouldn’t appreciate a translation of the 
> standard, but there are,
> as others have pointed out, obvious maintenance problems, not to mention the 
> issue that
> plagues some international institutions, namely the fact that translations 
> are necessarily
> non-canonical and so those who really care about the details of the rules 
> usually have to refer
> to a version in a particular language (sometimes that language might be 
> French rather than
> English; very occasionally there are two versions declared, for political 
> reasons, to both be
> canonical, which is obviously risky as there’s a chance they might differ 
> subtly on some point,
> perhaps even because of punctuation).

Sometimes it occurred in the EU that the French version was so sloppy it 
transformed the issue 
to entirely another one, but at the Unicode‐ISO/IEC merger the bad will was 
clearly on the other 
side —

> 
> In terms of widespread understanding of the standard, which is where I think 
> translation is
> perhaps more important, I’m not sure translating the actual standard itself 
> is really the way
> forward. It’d be better to ensure that there are reliable translations of 
> books like
> Unicode Demystified or Unicode Explained - or, quite possibly, other books 
> aimed more at
> the general public rather than the software community per se.

Good point. What we need most of all is a complete terminology, as well as full 
ranges of 
character names in every language, to enable people to talk about it after 
reading in English. 

Best regards,

Marcel

Re: Translating the standard

2018-03-12 Thread Marcel Schneider via Unicode

On Fri, 9 Mar 2018 08:41:35 -0800, Ken Whistler wrote:
> 
> 
> On 3/9/2018 6:58 AM, Marcel Schneider via Unicode wrote:
> > As of translating the Core spec as a whole, why did two recent attempts 
> > crash even
> > before the maintenance stage, while the 3.1 project succeeded?
> 
> Essentially because both the Japanese and the Chinese attempts were 
> conceived of as commercial projects, which ultimately did not cost out 
> for the publishers, I think. Both projects attempted limiting the scope 
> of their translation to a subset of the core spec that would focus on 
> East Asian topics, but the core spec is complex enough that it does not 
> abridge well. And I think both projects ran into difficulties in trying 
> to figure out how to deal with fonts and figures.

This is normally catered for by Unicode whose fonts are donated and 
licensed for the sole purpose of documenting the Standard. See FAQ.

Templates of any material to be translated are sent by Unicode, arenʼt 
they? The Unicode home page reads: “An essential part of our mission 
is to educate and engage academic and scientific communities, and 
the general public.” Therefore, translators should just have to translate 
e.g. the NamesList following Kenʼs sample localization (TN #24) — 
which is already a hard piece of work — and send the file to Unicode, 
to get a localized version of the Code Charts. Likewise ISO/IEC 10646 
is available in a French version or at least, it should have an official 
 French version like all ISO standards.

If Unicode donʼt own the tooling yet, Apple shall be happy to donate the 
funding to get Unicode in a position to fulfill their mission thoroughly,
like Apple (supposedly) donate non‐trivial amounts to many vendors to 
get them remove old software from the internet.

Using such localized NamesLists with Unibook to browse the Code Charts 
locally is another question, since that supposes handing the fonts out to 
the general public. So that is clearly a non‐starter. But browsing localized
Code Charts in Adobe Reader would be a nice facility.

Best regards,

Marcel

Re: Translating the standard

2018-03-11 Thread Marcel Schneider via Unicode

On 11/03/18 21:05, Arthur Reutenauer wrote:
> 
> On Sun, Mar 11, 2018 at 07:35:11PM +0100, Marcel Schneider via Unicode wrote:
> > I fail to understand why increasing complexity decreases the need to be 
> > widely understood.
> 
> I’m pretty sure that everybody will agree that the need gets all the
> greater as Unicode and connected technologies get more complex. But you
> can hopefully see that the cost also increases, and that’s incentive
> enough to refrain from doing it – as it already was very costly fifteen
> years ago, it’s likely to be prohibitive today.
> 
> > Recurrent threads show how slowly Unicode education 
> > is spreading among English native speakers; others incidentally complained 
> > about Unicode‐educational issues in African countries. *Not* translating 
> > the Standard — in whatever way — wonʼt help steepen the curve.
> 
> Nobody is saying “let’s not translate the Unicode Standard”; what
> several people here have pointed out is that it pays to have more modest
> and manageable goals. Besides, you’re hinting yourself that the
> problems are not only with translation, since they also affect native
> English speakers.

Indeed, to be fair. And for implementers, documenting themselves in English 
may scarcely ever have much of a problem, no matter whatʼs the locale.

Todayʼs policy is, that we are welcome to browse Wikipedia:

http://www.unicode.org/standard/WhatIsUnicode.html

Fundamentally thatʼs true (although the wording could use some fixes as of 
the difference between *using* Unicode and *documenting* Unicode), and
itʼs consistent with actual trends.

As of the cost — It still seems to me that weʼre far from the last word…

Best regards,

Marcel

1 2 >

1 - 100 of 129 matches

Mail list logo