subject:"RE\: The Unicode Standard and ISO"

On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> 
> On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > Still a computer should be understandable off-line, so CLDR providing a 
> > standard library of error messages could be 
> > appreciated by the industry.
>
> The kind of translations that CLDR accumulates, like day, and month names, 
> language and territory names, are a widely
> applicable subset and one that is commonly required in machine generated or 
> machine-assembled text (like displaying
> the date, providing pick lists for configuration of locale settings, etc).
> The universe of possible error messages is a completely different beast.
> If you tried to standardize all error messages even in one language you would 
> never arrive at something that would be
> universally useful. While some simple applications may find that all their 
> needs for communicating with their users are
> covered, most would wish they had some other messages available.

Indeed, error messages althouth technical are like the world’s books, a 
never-ending production of content. To account for 
this infinity, I was not proposing a closed set of messages to replace 
application libraries able to display message #123.
In fact I wrote first: “If to date, automatic [automated] translation of 
technical English still does not work, then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece of information.”
Here the piece of information displayed by the application is like a Lego 
spacecraft, the CLDR messages like Lego bricks.
I didn’t play with Lego since a very long time, but as a boy I learned how it 
works. I even remember that when building 
a construct, it often happened that some bricks were “missing”. A Lego box is 
complete wrt one or several models, but 
once my mom showing me the boxes on the shelves explained that they’re composed 
in a way that you’ll always lack 
something [when trying to build further]. — That doesn’t prevent Lego from 
thriving, nor many people from enjoying.

> To adopt your scheme, they would need to have a bifurcated approach, where 
> some messages follow the standard,
> while others do not (cannot). At that point, why bother? Determining whether 
> some message can be rewritten to follow
> the standard adds another level of complexity while you'd need to have 
> translation resources for all the non-standard ones anyway.

When CLDR libraries will allow to generate 98 % well-translated info boxes, 
human translators may focus on the remaining 
2 %. If for any reason they cannot, yet the vendor will get much less support 
requests than with the ill-translated messages.

> A middle ground is a shared terminology database that allows translators 
> working on different products to arrive at the same translation
> for the same things. Translators already know how to use such databases in 
> their work flow, and integrating a shared one with
> a product-specific one is much easier than trying to deal with a set of 
> random error messages.

If the scheme you outline works well, where come the reported oddities from? 
Obviously terminology is not all, it’s like Lego bricks without studs:
Terms alone don’t interlock and therefore the user cannot make sense. This is 
where CLDR’s hopefully on-coming localizable message bricks enter 
in action, helping automated translation software compose understandable 
output, using patterns. Google translate is unable to do that, as shown 
in the English and French translations of this sentence found in a page of the 
Finnish NB:
https://www.sfs.fi/ajankohtaista/uutiset/nappaimistoon_tarjolla_lisayksia.4249.news

Finnish: Kielitoimiston ohjeen mukaan esimerkiksi vieraskielisissä nimissä on 
pyrittävä säilyttämään kaikki tarkkeet.
Google English: According to the Language Office, for example, in the name of a 
foreign language, it is necessary to maintain all the checkpoints.
Google French: Selon le Language Office, par exemple, au nom d'une langue 
étrangère, il est nécessaire de maintenir tous les points de contrôle.

> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name. 
> 
> Especially if it is immediately tied to governmental procurement, forcing 
> people to adopt it (or live with it) whether it provides any actual benefit.

These statements make much sense to me…

> However, a high-quality terminology database recommends itself (and doesn't 
> need any procurement standards).
> Ultimately, it was its demonstrated usefulness that drove the adoption of 
> CLDR.

This is why I’m so hopeful that CLDR will go much farther than date and time 
and other locale settings, and emoji names and keywords.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-09 Thread Asmus Freytag via Unicode

On 6/9/2018 12:01 PM, Marcel Schneider
via Unicode wrote:

Still a computer should be understandable off-line, so CLDR providing a standard library of error messages could be
appreciated by the industry.

The kind of translations that CLDR accumulates,
like day, and month names, language and territory names, are a
widely applicable subset and one that is commonly required in machine
generated or machine-assembled text (like displaying the date,
providing pick lists for configuration of locale settings, etc).
The universe of possible error messages is a
completely different beast.
If you tried to standardize all error
messages even in one language you would never arrive at
something that would be universally useful. While some simple
applications may find that all their needs for communicating
with their users are covered, most would wish they had some
other messages available.
To adopt your scheme, they would need to
have a bifurcated approach, where some messages follow the standard,
while others do not (cannot). At that point, why bother?
Determining whether some message can be rewritten to follow the
standard adds another level of complexity while you'd need to
have translation resources for all the non-standard ones anyway.
A middle ground is a shared terminology
database that allows translators working on different products
to arrive at the same translation for the same things.
Translators already know how to use such databases in their work
flow, and integrating a shared one with a product-specific one
is much easier than trying to deal with a set of random error
messages.
It's pushing this kind of impractical scheme
that gives standardizers a bad name.

Especially if it is immediately tied to governmental
procurement, forcing people to adopt it (or live with it) whether
it provides any actual benefit.
However, a high-quality terminology database
recommends itself (and doesn't need any procurement standards).
Ultimately, it was its demonstrated
usefulness that drove the adoption of CLDR.

A./

RE: The Unicode Standard and ISO

On the other hand, most end-users don’t appreciate to get “a screenfull of 
all-in-English” when “something happened.”
If even big companies still didn’t succeed in getting automatted computer 
translation to work for error messages, then 
best practice could eventually be to provide an internet link with every 
message. Given that web pages are generally 
less sibylline than error messages, they may be better translatable, and 
Philippe Verdy’s hint is therefore a working 
solution for localized software end-user support.

Still a computer should be understandable off-line, so CLDR providing a 
standard library of error messages could be 
appreciated by the industry.

Best regards,

Marcel 

On Sat, 9 Jun 2018 18:14:17 +, Jonathan Rosenne via Unicode wrote:
> 
> Translated error messages are a horror story. Often I have to play around 
> with my locale settings to avoid them.
> Using computer translation on programming error messages is no way near to 
> being useful.
> 
> Best Regards,
> 
> Jonathan Rosenne
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe 
> Verdy via Unicode
> Sent: Saturday, June 09, 2018 7:49 PM
> To: Marcel Schneider
> Cc: UnicodeMailingList
> Subject: Re: The Unicode Standard and ISO
 

 

2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode :
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> > 
> > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> > 
> > > > Where there is opportunity for productive sync and merging with is
> > > > glibc. We have had some discussions, but more needs to be done-
> > > > especially a lot of tooling work. Currently many bug reports are
> > > > duplicated between glibc and cldr, a sort of manual
> > > > synchronization. Help wanted here.  
> > > 
> > > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > > help.
> > 
> > I wonder how much of that comes under the sad category of "better not
> > translated". If an English speaker has to resort to search engines to
> > understand, let alone fix, a reported problem, it may be better for a
> > non-English speaker to search for the error message in English, and then
> > with luck he may find a solution he can understand.
> 
> Then adding a "Display in English" button in the message box is best practice.
> Still I’ve never encountered any yet, and I guess this is because such a 
> facility 
> would be understood as an admission that up to now, i18n is partly a failure.

 


- Navigate any page on the web in another language than yours, with a Google 
Translate plugin enabled on your browser. you'll have the choice of seeing 
the automatic translation or the original.


 


- Many websites that have pages proposed in multiple languages offers such 
buttons to select the language you want to see (and not necesarily falling 
back to English, becausse the original may as well be in another language and 
English is an approximate translation, notably for sites in Asia, Africa and 
south America).


 


- Even the official websites of the European Union (or EEA) offers such choice 
(but at least the available translations are correctly reviewed for European 
languages; not all pages are translated in all official languages of member 
countries, but this is the case for most pages intended to be read by the 
general public, while pages about ongoing works, or technical reports for 
specialists, or recent legal decisions may not be translated except in a few 
"working languages", generally English, German, and French, sometimes Italian, 
the 4 languages spoken officially in multiple countries in the EEA 
including at least one in the European Union).


 


So it's not a "failure" but a feature to be able to select the language, and to 
know when a proposed translation is fully or partly automated.

RE: The Unicode Standard and ISO

2018-06-09 Thread Jonathan Rosenne via Unicode

Translated error messages are a horror story. Often I have to play around with 
my locale settings to avoid them. Using computer translation on programming 
error messages is no way near to being useful.

Best Regards,

Jonathan Rosenne

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Saturday, June 09, 2018 7:49 PM
To: Marcel Schneider
Cc: UnicodeMailingList
Subject: Re: The Unicode Standard and ISO



2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode 
mailto:unicode@unicode.org>>:
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
>
> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> Marcel Schneider via Unicode 
> mailto:unicode@unicode.org>> wrote:
>
> > > Where there is opportunity for productive sync and merging with is
> > > glibc. We have had some discussions, but more needs to be done-
> > > especially a lot of tooling work. Currently many bug reports are
> > > duplicated between glibc and cldr, a sort of manual
> > > synchronization. Help wanted here.
> >
> > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > help.
>
> I wonder how much of that comes under the sad category of "better not
> translated". If an English speaker has to resort to search engines to
> understand, let alone fix, a reported problem, it may be better for a
> non-English speaker to search for the error message in English, and then
> with luck he may find a solution he can understand.

Then adding a "Display in English" button in the message box is best practice.
Still I’ve never encountered any yet, and I guess this is because such a 
facility
would be understood as an admission that up to now, i18n is partly a failure.

- Navigate any page on the web in another language than yours, with a Google 
Translate plugin enabled on your browser. you'll have the choice of seeing the 
automatic translation or the original.

- Many websites that have pages proposed in multiple languages offers such 
buttons to select the language you want to see (and not necesarily falling back 
to English, becausse the original may as well be in another language and 
English is an approximate translation, notably for sites in Asia, Africa and 
south America).

- Even the official websites of the European Union (or EEA) offers such choice 
(but at least the available translations are correctly reviewed for European 
languages; not all pages are translated in all official languages of member 
countries, but this is the case for most pages intended to be read by the 
general public, while pages about ongoing works, or technical reports for 
specialists, or recent legal decisions may not be translated except in a few 
"working languages", generally English, German, and French, sometimes Italian, 
the 4 languages spoken officially in multiple countries in the EEA including at 
least one in the European Union).

So it's not a "failure" but a feature to be able to select the language, and to 
know when a proposed translation is fully or partly automated.

Re: The Unicode Standard and ISO

2018-06-09 Thread Philippe Verdy via Unicode

2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode :

> On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> >
> > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> >
> > > > Where there is opportunity for productive sync and merging with is
> > > > glibc. We have had some discussions, but more needs to be done-
> > > > especially a lot of tooling work. Currently many bug reports are
> > > > duplicated between glibc and cldr, a sort of manual
> > > > synchronization. Help wanted here.
> > >
> > > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > > help.
> >
> > I wonder how much of that comes under the sad category of "better not
> > translated". If an English speaker has to resort to search engines to
> > understand, let alone fix, a reported problem, it may be better for a
> > non-English speaker to search for the error message in English, and then
> > with luck he may find a solution he can understand.
>
> Then adding a "Display in English" button in the message box is best
> practice.
> Still I’ve never encountered any yet, and I guess this is because such a
> facility
> would be understood as an admission that up to now, i18n is partly a
> failure.


- Navigate any page on the web in another language than yours, with a
Google Translate plugin enabled on your browser. you'll have the choice of
seeing the automatic translation or the original.

- Many websites that have pages proposed in multiple languages offers such
buttons to select the language you want to see (and not necesarily falling
back to English, becausse the original may as well be in another language
and English is an approximate translation, notably for sites in Asia,
Africa and south America).

- Even the official websites of the European Union (or EEA) offers such
choice (but at least the available translations are correctly reviewed for
European languages; not all pages are translated in all official languages
of member countries, but this is the case for most pages intended to be
read by the general public, while pages about ongoing works, or technical
reports for specialists, or recent legal decisions may not be translated
except in a few "working languages", generally English, German, and French,
sometimes Italian, the 4 languages spoken officially in multiple countries
in the EEA including at least one in the European Union).

So it's not a "failure" but a feature to be able to select the language,
and to know when a proposed translation is fully or partly automated.

Re: The Unicode Standard and ISO

On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> 
> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
> 
> > > Where there is opportunity for productive sync and merging with is
> > > glibc. We have had some discussions, but more needs to be done-
> > > especially a lot of tooling work. Currently many bug reports are
> > > duplicated between glibc and cldr, a sort of manual
> > > synchronization. Help wanted here.  
> > 
> > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > help.
> 
> I wonder how much of that comes under the sad category of "better not
> translated". If an English speaker has to resort to search engines to
> understand, let alone fix, a reported problem, it may be better for a
> non-English speaker to search for the error message in English, and then
> with luck he may find a solution he can understand.

Then adding a "Display in English" button in the message box is best practice.
Still I’ve never encountered any yet, and I guess this is because such a 
facility 
would be understood as an admission that up to now, i18n is partly a failure.

> In a related vein,
> one hears reports of people using English as the interface language,
> because they can't understand the messages allegedly in their native
> language.

If to date, automatic translation of technical English still does not work, 
then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece 
of information. But such an attempt requires that all available human resources 
really 
focus on the project, instead of being diverted by interpersonal discordances. 
Sulking 
people around a project are an indicator of poor project management branding 
dissenters 
as enemies out of an inability to behave in a diplomatic way by lack of social 
skills.
At least that’s what they’d teach you in any management school.

The way Unicode behaves against William Overington is in my opinion a striking 
example 
of mismanagement. In one dimension I can see, the "localizable sentences" that 
William invented and that he actively promotes do fit exactly into the scheme 
of localizable 
information elements suggested in the preceding paragraph. I strongly recommend 
that 
instead of publicly blacklisting the author in the mailbox of the president and 
directing 
the List moderation to prohibit the topic as out of scope of Unicode, an 
extensible and flexible 
framework be designed in urgency under the Unicode‐CLDR umbrella to put an end 
to the 
pseudo‐localization that Richard pointed above.

OK I’m lacking diplomatic skills too, and this e‐mail is harsh, but I see it as 
a true echo.
And I apologize for my last reply to William Overington, if I need to.
http://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0118.html

Beside that, I’d suggest also to add a CLDR library of character name elements 
allowing 
to compose every existing Unicode character name in all supported locales, for 
use in 
system character pickers and special character dialogs. This library should 
then be updated 
at each major release of the UCS. Hopefully this library is then flexible 
enough to avoid 
any Standardese, be it in English, in French, or in any language aping English 
Standardese.
E.g. when the ISO/IEC 10646 mirror of Unicode was published in an official 
French version, 
the official translators felt partly committed to ape English Standardese, of 
which we know 
that it isn’t due mainly to Unicode, but to the then‐head of ISO/IEC JTC1 SC2 
WG2. Not to 
warm up that old grudge, just to show how on‐topic that is. Be it Standardese 
or pseudo‐
localization, the effect is always to worsen UX by missing the point.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-09 Thread Philippe Verdy via Unicode

I just see the WG2 as a subcomity where governements may just check their
practices and make minimum recommendations. Most governements are in fact
very late to adopt the industry standards that evolve fast, and they just
want to reduce the frequency of necessary changes jsut to enterinate what
seems to be stable enough and gives them long enough period to plan the
transitions. So ISO 10646 has had in fact very few updates compared to
Unicode (even if these Unicode changes were "synchronized", most of them
remained for long within optional amendments that are then synchronized in
ISO 10646 long after the inbdustry has started working on updating their
code for Unicode and made checks to ensure that it is stable enough to be
finally included in ISO 10646 later as the new minimal platform that
governments can reasonnably ask to be provided by their providers in the
industry at reasonnable (or no) additional cost.
So I see now ISO 646 only as a small subset of the Unicode standard. The
WG2 technical comity is jsut there to finally approve what can be endorsed
as a standard whose usage is made mandatory in governments, when the UTS
itself is still (and will remain) just optional (not a requirement). It
takes months or years to have new TUS features being available on all
platforms that governements use. WG2 probably does not focus really on
technical merits, but just evaluating the implementation and deployment
costs, and that's where the WG2 members decide what is reasonnable for them
to adopt (let's also not forget that ISO standards are mapped to national
standards that reference it normatively, and these national standards (or
European standards in the EEA) are legal requirements: governements then no
longer need to specify each time which requirement they want, they're just
saying that the national standards within a certain class are required for
all product/service offers, and failure to implement theses standards will
require those providers to fix their products at no additional cost, and
independantly of the contractual or subscribed period of support).

2018-06-08 23:28 GMT+02:00 Marcel Schneider via Unicode :

> On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:
> >
> […]
> > There's no value added in creating "mirrors" of something that is
> successfully being developed and maintained under a different umbrella.
>
> Wouldn’t the same be true for ISO/IEC 10646? It has no value added
> neither, and WG2 meetings could be merged with UTC meetings.
> Unicode maintains the entire chain, from the roadmap to the production
> tool (that the Consortium ordered without paying a full license).
>
> But the case is about part of the people who are eager to maintain an
> alternate forum, whereas the industry (i.e. the main users of the data)
> are interested in fast‐tracking character batches, and thus tend to
> shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying
> the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason
> why it was not, is that Unicode was weaker and needed support
> from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being
> useless in practice, as it pursued an unrealistic encoding scheme.
> To overcome this, somebody in ISO started actively campaigning for the
> Unicode encoding model, encountering fierce resistance from fellow
> ISO people until he succeeded in teaching them real‐life computing. He had
> already invented and standardized the sorting method later used
> to create UCA and ISO/IEC 14651. I don’t believe that today everybody
> forgot about him.
>
> Marcel
>
>

Re: The Unicode Standard and ISO

2018-06-09 Thread Richard Wordingham via Unicode

On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
Marcel Schneider via Unicode  wrote:

> > Where there is opportunity for productive sync and merging with is
> > glibc. We have had some discussions, but more needs to be done-
> > especially a lot of tooling work. Currently many bug reports are
> > duplicated between glibc and cldr, a sort of manual
> > synchronization. Help wanted here.   
> 
> Noted. For my part, sadly for C libraries I’m unlikely to be of any
> help.

I wonder how much of that comes under the sad category of "better not
translated".  If an English speaker has to resort to search engines to
understand, let alone fix, a reported problem, it may be better for a
non-English speaker to search for the error message in English, and then
with luck he may find a solution he can understand.  In a related vein,
one hears reports of people using English as the interface language,
because they can't understand the messages allegedly in their native
language.

Richard.

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 09:20:09 -0700, Steven R. Loomis via Unicode wrote:
[…]
> But, it sounds like the CLDR process was successful in this case. Thank you 
>for contributing.

You are welcome, but thanks are due to the actual corporate contributors.

[…]
> Actually, I think the particular data item you found is relatively new. The 
> first values entered
> for it in any language were May 18th of this year.  Were there votes for 
> "keycap" earlier?

The "keycap" category is found as soon as in v30 (released 2016-10-05).

> Rather than a tracer finding evidence of neglect, you are at the forefront of 
> progressing the translated data for French. Congratulations!

The neglect is on my part as I neglected to check the data history. 
Please note that I did not make accusations of neglect. Again: The historic 
Code Charts translators, partly still active, sulk CLDR 
because Unicode is perceived as sulking ISO/IEC 15897, so that minimal staff is 
actively translating CLDR for the French locale and can 
legitimately feel forsaken. I even made detailed suppositions as of how it 
could happen that "keycap" remained untranslated.

[…] [Unanswered questions (please refer to my other e‐mails in this thread)]

> The registry for ISO/IEC 15897 has neither data for French, nor structure 
> that would translate the term "Characters | Category | Label | keycap". 
> So there would be nothing to merge with there.

Correct. The only data for French is an ISO/IEC 646 charset:
http://std.dkuug.dk/cultreg/registrations/number/156
As far as I can see there are available data to merge for Danish, Faroese, 
Finnish Greenlandic, Norwegian, and Swedish.

> So, historically, CLDR began not a part of Unicode, but as part of Li18nx 
> under the Free Standards Group. See the bottom of the page 
> http://cldr.unicode.org/index/acknowledgments
> "The founding members of the workgroup were IBM, Sun and OpenOffice.org". 
> What we were trying to do was to provide internationalized content for Linux, 
> and also, to resolve the then-disparity between locale data
> across platforms. Locale data was very divergent between platforms - spelling 
> and word choice changes, etc.  Comparisons were done
> and a Common locale data repository  (with its attendant XML formats) 
> emerged. That's the C in CLDR. Seed data came from IBM’s ICIR
> which dates many decades before 15897 (example 
> http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/
> - 4th edition published in 1994.) 100 locales we contributed to glibc as well.

Thank you for the account and resources. The Linux Internationalization 
Initiative appears to have issued a last release on August 23, 2000:
https://www.redhat.com/en/about/press-releases/83
the year before ISO/IEC 15897 was lastly updated:
http://std.dkuug.dk/cultreg/registrations/chreg.htm

> Where there is opportunity for productive sync and merging with is glibc. We 
> have had some discussions, but more needs to be
> done- especially a lot of tooling work. Currently many bug reports are 
> duplicated between glibc and cldr, a sort of manual synchronization.
> Help wanted here. 

Noted. For my part, sadly for C libraries I’m unlikely to be of any help.

Marcel

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 20:45:26 +0200
Philippe Verdy via Unicode  wrote:

> 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  

> The way tailoring is designed in CLDR using only data used by a
> generic algorithm, and not custom algorithm is not the only way to
> collate Lao. You can perectly add new custom algorithm promitives
> that will use new collation data rules that can be inserted as
> "hooks" in UCA (which provides several points at which it is
> possible, but UCA just makes these hooks act as "no-op".

The ideal is to have a common library rather than add specific routines
to support specific languages.  Now, this can be done in a common
library; ICU break iterators have dedicated routines for CJK and for
Siamese.  I wonder if this could be done for Lao and possibly Tai
Lue.  I've a vague recollection that UCA collation for Tai Lue in the
New Tai Lue script only needs thousands of contractions, so it may work
well enough in the main CLDR collation algorithm.  Martin Hosken
provided the numbers, probably on the Unicore list, when New Tai Lue
formally switched from phonetic to visual order.  Taking the definition
of logical order literally, the change legitimised the logical order of
New Tai Lue. 

> You can be much faster is you create a specific library for Lao, that
> would still be able to process the basic collation rules and then
> make more advanced inferences based on larger cluster boundaries than
> just those considered in the standard basic UCA, so it is perfectly
> possible to extend it to cover more complex Lao syllables and various
> specific quirks (such as hyphenation in the middle of clusters, as
> seen in some Indic scripts using left matras).

How is this hyphenation done?  The answer probably belongs in the
thread entitled 'Hyphenation Markup', unless its restricted to the
visual order scripts.  If it's occurring in the visual order scripts,
we may need to add contractions for ; U+00AD breaks contractions, and, indeed, may be used for
exactly that purpose, as it is generally easier to type than CGJ.
While I've seen line-breaking after a left matra in Thai, I've never
*seen* a hyphen after a left matra.

Richard.

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 14:14:51 -0700
"Steven R. Loomis via Unicode"  wrote:

> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR. Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation.  

>  CLDR is not a collation implementation, it is a data repository with
> associated specification. It was never required to 'support' DUCET.
> The contents of CLDR have no bearing on whether implementations
> support DUCET.

DUCET used to be the root collation of CLDR.

> CLDR ≠ ICU.

DUCET is a standard collation.  Language-specific collations are
stored in CLDR, so why not an international standard?  Does ICU store
collations not defined in CLDR?  The formal snag is that the collations
have to be LDML tailorings of the CLDR root collation, which is a
formal problem for U+FDD0.  I would expect you to argue that it is more
useful for U+FDD0 to have the special behaviour defined in CLDR, and
restrict conformance with DUCET to characters other than non-characters.

> On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  

> > On Fri, 8 Jun 2018 13:40:21 +0200
> > Mark Davis ☕️  wrote:

> > > > The UCA contains features essential for respecting canonical
> > > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > > apparently even going to the extreme of implicitly declaring
> > > > that Vietnamese is not a human language.  

> > > A bit over the top, eh?  
> >
> > Then remove the "no known language" from the bug list

> What does this refer to?

http://userguide.icu-project.org/collation/customization

Under the heading "Known Limitations" it says:

"The following are known limitations of the ICU collation
implementation. These are theoretical limitations, however, since there
are no known languages for which these limitations are an issue.
However, for completeness they should be fixed in a future version
after 1.8.1. The examples given are designed for simplicity in testing,
and do not match any real languages."

Then, the particular problem is listed under the heading "Contractions
Spanning Normalization".  The assumption is that FCD strings do not
need to be decomposed.  This comes unstuck when what is locally a
secondary weight due to a diacritic on a vowel has to be promoted to a
primary weight to support syllable by syllable collation in a system
not set up for such a tiered comparison.

> > …ICU isn't
> > fast enough to load a collation from customisation - it takes
> > hours!  

> > ICU is, alas, ridiculously slow

> I'm also curious what this refers to, perhaps it should be a separate
> ICU bug?

There may be reproducibility issues.  A proper bug report will take some
work.  There's also the argument that nearly 200,000 contractions is
excessive.  I had to disable certain checks that were treating "should
not" as a prohibition - working round them either exceeded ICU's
capacity because of the necessary increase in the number of
contractions, or was incompatible with the design of the collation.

The weight customisation creates 45 new weights, with lines like

"&\u0EA1 = \ufdd2\u0e96 < \ufdd2\u0e97 # MO for THO_H & THO_L"

I use strings like \ufdd2\u0e96 to emulate ISO/IEC 14651
(primary) weights.  I carefully reuse default Lao weights so as to keep
collating elements' list of collation elements short.

There are a total of 187174 non-comment lines, most being simple
contractions like

"&\u0ec8\ufdd2\u0e96\ufdd2AAW\ufdd3\u0e94 = \u0ec8\u0e96\u0ead\u0e94 #
1+K+AW+N  N is mandatory!"

and prefix contractions like

"&\ufdd2AAW\ufdd3\u0e81\u0ec9 = \u0e96\u0ec9 | ອ\u0e81 # K+1|ອ+N
 N is mandatory".

I strip the comments off as I convert the collation definition to
UTF-16; if I remember correctly I also have to convert escape sequences
to characters.  That processing is a negligible part of the time.

By comparison, the loading of 30,000 lines from allkeys.txt is barely
discernible.

The generation of the loading of the collation was reasonably fast when
I generated DUCET-style collation weights using bash.

For my purposes, I would get better performance if ICU's collation just
blindly converted strings to NFD, but then all I am using it for is to
compare collation rules against a dictionary.  I suspect it's just that
I lose out massively as a result of ICU's tradeoffs.

Richard.

Re: The Unicode Standard and ISO

2018-06-08 Thread Asmus Freytag via Unicode

On 6/8/2018 2:28 PM, Marcel Schneider
via Unicode wrote:

On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:

[…]

There's no value added in creating "mirrors" of something that is successfully being developed and maintained under a different umbrella.

Wouldn’t the same be true for ISO/IEC 10646? It has no value added neither, and WG2 meetings could be merged with UTC meetings.
Unicode maintains the entire chain, from the roadmap to the production tool (that the Consortium ordered without paying a full license).

Without going into a lot of historical detail, the situations are
not comparable; I don't think I agree to the way you summarize
things here, but unfortunately I have not the time to elaborate
further. It suffices to note that 10646 was and is a special case.

Not every attempt at standardization has to happen at ISO. Even on a
treaty level there have always been other organizations, for example
ITU.

Almost the worst thing you can do is duplicating an existing and
well-established effort (by which I mean not a paper effort, but one
that is being implemented widely). Doing so just adds needless
complexity, but it will always satisfy people who are engaging in
the kind of turf-war that makes them feel important.

A./

But the case is about part of the people who are eager to maintain an alternate forum, whereas the industry (i.e. the main users of the data)
are interested in fast‐tracking character batches, and thus tend to shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying
the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why it was not, is that Unicode was weaker and needed support
from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being useless in practice, as it pursued an unrealistic encoding scheme.
To overcome this, somebody in ISO started actively campaigning for the Unicode encoding model, encountering fierce resistance from fellow
ISO people until he succeeded in teaching them real‐life computing. He had already invented and standardized the sorting method later used
to create UCA and ISO/IEC 14651. I don’t believe that today everybody forgot about him.

Marcel

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 16:54:20 -0400, Tom Gewecke via Unicode wrote:
> 
> > On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode  wrote:
> > 
> > People relevant to projects for French locale do trace the borderline of 
> > applicability wider 
> > than do those people who are closerly tied to Unicode‐related projects.
> 
> Could you give a concrete example or two of what these people mean by “wider 
> borderline of applicability”
> that might generate their ethical dilemma?
> 

Drawing the borderline until which ISO/IEC should be among the involved 
parties, as I put it, is about the Unicode policy 
as of how ISO/IEC JTC1 SC2 WG2 is involved in the process, how it appears in 
public (FAQs, Mailing List responding practice, 
and so on), and how people in that WG2 feel with respect to Unicode. That may 
be different depending on the standard concerned 
(ISO/IEC 10646, ISO/IEC 14651), so that the former is put in the first place as 
vital to Unicode, while the latter is almost entirely 
hidden (except in appendix B of UTS #10).

Then when it’s up to locale data, Unicode people see the borderline below, 
while ISO people tend to see it above. This is why 
Unicode people do not want the twin‐standards‐bodies‐principle applied to 
locale data, and are ignoring or declining any attempt 
to equalize situations, arguing that ISO/IEC 15897 is useless. As I’ve pointed 
in my previous e‐mail responding to Asmus Freytag, 
ISO/IEC 10646 was about as useless until Unicode came on it and merged itself 
with that UCS embryo (not to say that miscarriage 
on the way). The only thing WG2 could insist upon were names and huge bunches 
of precomposed or preformatted characters that 
Unicode was designed to support in plain text by other means. The essential 
part was Unicode’s, and without Unicode we wouldn’t 
have any usable UCS. ISO/IEC 15897 appears to be in a similar position: not 
very useful, not very performative, not very complete. 
But an ISO/IEC standard. Logically, Unicode should feel committed to merge with 
it the same way it did with the other standard, 
maintaining the data, and publishing periodical abstracts under ISO coverage. 
There is no problem in publishing a framework standard 
under the ISO/IEC umbrella, associated with a regular up‐to‐date snapshot of 
the data.

That is what I mean when I say that Unicode arbitrarily draw borderlines of 
their own, regardless of how people at ISO feel about them.

Marcel

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:
> 
[…]
> There's no value added in creating "mirrors" of something that is 
> successfully being developed and maintained under a different umbrella.

Wouldn’t the same be true for ISO/IEC 10646? It has no value added neither, and 
WG2 meetings could be merged with UTC meetings.
Unicode maintains the entire chain, from the roadmap to the production tool 
(that the Consortium ordered without paying a full license).

But the case is about part of the people who are eager to maintain an alternate 
forum, whereas the industry (i.e. the main users of the data) 
are interested in fast‐tracking character batches, and thus tend to shortcut 
the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying 
the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why 
it was not, is that Unicode was weaker and needed support 
from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being 
useless in practice, as it pursued an unrealistic encoding scheme.
To overcome this, somebody in ISO started actively campaigning for the Unicode 
encoding model, encountering fierce resistance from fellow 
ISO people until he succeeded in teaching them real‐life computing. He had 
already invented and standardized the sorting method later used 
to create UCA and ISO/IEC 14651. I don’t believe that today everybody forgot 
about him.

Marcel

Re: The Unicode Standard and ISO

2018-06-08 Thread Steven R. Loomis via Unicode

Richard,

> But the consortium has formally dropped the commitment to DUCET in CLDR.
> Even when restricted to strings of assigned characters, the
> CLDR and ICU no longer make the effort to support the DUCET
> collation.

 CLDR is not a collation implementation, it is a data repository with
associated specification. It was never required to 'support' DUCET. The
contents of CLDR have no bearing on whether implementations support DUCET.

CLDR ≠ ICU.

On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 8 Jun 2018 13:40:21 +0200
> Mark Davis ☕️  wrote:
>
> > > The UCA contains features essential for respecting canonical
> > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > apparently even going to the extreme of implicitly declaring that
> > > Vietnamese is not a human language.
>
> > A bit over the top, eh?
>
> Then remove the "no known language" from the bug list
>

What does this refer to?

>
> …ICU isn't
> fast enough to load a collation from customisation - it takes hours!

…

> ICU is, alas, ridiculously slow
>

I'm also curious what this refers to, perhaps it should be a separate ICU
bug?

Re: The Unicode Standard and ISO

2018-06-08 Thread Tom Gewecke via Unicode

> On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode 
>  wrote:
> 
>  People relevant to projects for French locale do trace the borderline of 
> applicability wider 
> than do those people who are closerly tied to Unicode‐related projects.

Could you give a concrete example or two of what these people mean by “wider 
borderline of applicability” that might generate their ethical dilemma?

Re: The Unicode Standard and ISO

2018-06-08 Thread Asmus Freytag via Unicode


  
  
On 6/8/2018 5:01 AM, Michael Everson
  via Unicode wrote:


  
and achieving a fullscale merger with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. 

  
  I wonder if Mark Davis will be quick to agree with me  when I say that ISO/IEC 15897 has no use and should be withdrawn

I don't know about Mark, but that would have
been my position. 
  
There's no value added in creating "mirrors"
of something that is successfully being developed and maintained
under a different umbrella.
  
A./

Re: The Unicode Standard and ISO

2018-06-08 Thread Philippe Verdy via Unicode

2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Fri, 8 Jun 2018 13:40:21 +0200
> Mark Davis ☕️  wrote:
>
> > Mark
> >
> > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> > unicode@unicode.org> wrote:
> >
> > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > > Marcel Schneider via Unicode  wrote:
> > >
> > > > Thank you for confirming. All witnesses concur to invalidate the
> > > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > > After being invented in its actual form, sorting was standardized
> > > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > > Algorithm, the latter including practice‐oriented extra
> > > > features.
> > >
> > > The UCA contains features essential for respecting canonical
> > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > apparently even going to the extreme of implicitly declaring that
> > > Vietnamese is not a human language.
>
> > A bit over the top, eh?
>
> Then remove the "no known language" from the bug list, or declare that
> you don't know SE Asian languages.
>
> The root problem is that the UCA cannot handle syllable by syllable
> comparisons; if the UCA could handle that, the correct collation of
> unambiguous true Lao would become simple.  The CLDR algorithm provides
> just enough memory to make Lao collation possible; however, ICU isn't
> fast enough to load a collation from customisation - it takes hours!
> One could probably do better if one added suffix contractions, but
> adding that capability might be nightmare.

The way tailoring is designed in CLDR using only data used by a generic
algorithm, and not custom algorithm is not the only way to collate Lao. You
can perectly add new custom algorithm promitives that will use new
collation data rules that can be inserted as "hooks" in UCA (which provides
several points at which it is possible, but UCA just makes these hooks act
as "no-op".

You can be much faster is you create a specific library for Lao, that would
still be able to process the basic collation rules and then make more
advanced inferences based on larger cluster boundaries than just those
considered in the standard basic UCA, so it is perfectly possible to extend
it to cover more complex Lao syllables and various specific quirks (such as
hyphenation in the middle of clusters, as seen in some Indic scripts using
left matras).

Not everything has to be specified by UCA itself notably if it's specific
to a script (or sometimes only a single locale, i.e. a specific combination
of a script, language, orthographic convention, and stylistic convention
for some kinds of documents or presentations).

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 13:40:21 +0200
Mark Davis ☕️  wrote:

> Mark
> 
> On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> >  
> > > Thank you for confirming. All witnesses concur to invalidate the
> > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > After being invented in its actual form, sorting was standardized
> > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > Algorithm, the latter including practice‐oriented extra
> > > features.  
> >
> > The UCA contains features essential for respecting canonical
> > equivalence.  ICU works hard to avoid the extra effort involved,
> > apparently even going to the extreme of implicitly declaring that
> > Vietnamese is not a human language.  

> A bit over the top, eh?

Then remove the "no known language" from the bug list, or declare that
you don't know SE Asian languages.

The root problem is that the UCA cannot handle syllable by syllable
comparisons; if the UCA could handle that, the correct collation of
unambiguous true Lao would become simple.  The CLDR algorithm provides
just enough memory to make Lao collation possible; however, ICU isn't
fast enough to load a collation from customisation - it takes hours!
One could probably do better if one added suffix contractions, but
adding that capability might be nightmare.

> I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868,
> which nicely outlines a proposal for dealing with a number of
> problems with Vietnamese.

It still includes a brute force work-around.

> We clearly don't support every sorting feature that various
> dictionaries and agencies come up with. Sometimes it is because we
> can't (yet) see a good way to do it:

>1. it might be not determinant: many governmental standards or
> style sheets require "interesting" sorting, such as determining that
> "XI" is a roman numeral (not the president of China) and sorting as
> 11, or when "St." is meant to be Street *and* when meant to be Saint
> (St. Stephen's St.)

I believe the first is a character identity issue.  Some of us
see the difference between U+0058 LATIN CAPITAL LETTER X and the
discouraged U+2169 ROMAN NUMERAL TEN as more than just a round-tripping
difference.  For example, by hand, I write the 'V' in 'Henry V' with a
regnal number quite differently to 'Henry V.' where 'V' is short for a
name.

> > > Since then,
> > > these two standards are kept in synchrony uninterruptedly.  

> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR.  Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation. Indeed, I'm not even sure that the DUCET is a tailoring
> > of the root CLDR collation, even when restricted to assigned
> > characters.  Tailorings tend to have odd side effects; fortunately,
> > they rarely if ever matter. CLDR root is a rewrite with
> > modifications of DUCET; it has changes that are prohibited as
> > 'tailorings'! 

> CLDR does make some tailorings to the DUCET to create its root
> collation, notably adding special contractions of private use
> characters to allow for tailoring support and indexes [
> http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
> ]  plus the rearrangement of some characters (mostly punctuation and
> symbols) to allow runtime parametric reordering of groups of
> characters (eg to put numbers after letters) [
> http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
> ].

My main point is that for practical purposes (i.e. ICU), Unicode has
moved away from ISO/IEC 14651.  The difference is small.  I didn't say
that there weren't good reasons.

>- If there are other changes that are not well documented, or if
> you think those features are causing problems in some way, please
> file a ticket.

Well, I don't have to use DUCET, though I've found it easier for
unmaintainable tailorings.  I need to write code to apply
non-parametric LDML tailorings - ICU is, alas, ridiculously slow.  I
hope that's just a matter of optimisation balance between compiling a
tailoring and applying it.  Are there any published compliance tests
for non-parametric tailorings?  I'm not sure how one would check that an
alleged parametric reordering of numbers and letters applied to a
tailoring of DUCET was in accordance with the LDML definition, but I
don't think you want to expend money sorting that out. 

>- If there is a particular change that you think is not conformant
> to UCA, please also file that.

Sorry, I must have scanned the conformance requirements too quickly.  I
had got it into my head that someone had recklessly required that
tailorings being in accordance with LDML.  That constraint only applies
to parametric tailorings, so any properly structured unambiguously

Re: The Unicode Standard and ISO

2018-06-08 Thread Steven R. Loomis via Unicode

Marcel,

On Fri, Jun 8, 2018 at 6:52 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:
>
> What got me started is that before even I requested a submitter ID (and
> the reason why I’ve requested one),
> "Characters | Category | Label | keycap" remained untranslated, i.e. its
> French translation was "keycap".
> When I proposed "cabochon", the present contributors kindly upvoted or
> proposed "touche" even before I
> launched a forum thread, and when I got aware, I changed my vote and
> posted the rationale on the forum,
> so the upvoting contributor kindly followed so that now we stay united for
> "touche", rather than "keycap".
>


 But, it sounds like the CLDR process was successful in this case. Thank
you for contributing.


> Please note that I acknowledge everybody and don’t criticize anybody. It
> doesn’t require much imagination
> to figure out that when CLDR was set up, there were so few or even no
> French contributors that translating
> "keycap" either fell out of deadline or was overlooked or whatever, and
> later passed unnoticed. That is a
> tracer detecting that none of the people setting up the French translation
> of the Code Charts were ever on
> the CLDR project. Because if anybody of them had been active on CLDR, no
> English word would have been
> kept in use mistakenly for the French locale.
>

Actually, I think the particular data item you found is relatively new. The
first values entered for it in any language were May 18th of this year.
Were there votes for "keycap" earlier?
Rather than a tracer finding evidence of neglect, you are at the forefront
of progressing the translated data for French. Congratulations!

> French contributors are not "prevented from cooperating". Where do you
get this from? Who do you mean?

>
> Historic French contributors are ethically prevented from contributing to
> CLDR, because of a strong commitment to involve ISO/IEC,
> a notion that is very meaningful to Unicode. People relevant to projects
> for French locale do trace the borderline of applicability wider
> than do those people who are closerly tied to Unicode‐related projects.


Which contributors specifically are prevented?


> > There were not "many attempts" at a merger, and Unicode didn't "refuse"
> anything. Who do you think "attempted", and when?
>
> An influential person consistently campaigned for a merger of CLDR and
> ISO/IEC 15897, but that never succeeded. It’s unlikely to be ignored.


Which person?

> Albeit given the state of ISO/IEC 15897, there was nothing such a merger
> would have contributed anyway.
>
> I’ve took a glance at the data of ISO/IEC 15897 and cannot figure out that
> there is nothing to pick from. At least they won’t be disposed to
> sell you "keycap" as a French term or as being in any use in that target
> locale. And anyhow, the gesture would be appreciated as a piece
> of good diplomacy. Hopefully a lightweight proceeding could end up in that
> data being transferred to CLDR, and this being cited as sole
> normative reference in ISO/IEC 15897. As a result, everybody’s happy.
>

 The registry for ISO/IEC 15897 has neither data for French, nor structure
that would translate the term "Characters | Category | Label | keycap". So
there would be nothing to merge with there.

So, historically, CLDR began not a part of Unicode, but as part of Li18nx
under the Free Standards Group. See the bottom of the page
http://cldr.unicode.org/index/acknowledgments "The founding members of the
workgroup were IBM, Sun and OpenOffice.org".  What we were trying to do was
to provide internationalized content for Linux, and also, to resolve the
then-disparity between locale data across platforms. Locale data was very
divergent between platforms - spelling and word choice changes, etc.
Comparisons were done and a Common locale data repository  (with its
attendant XML formats) emerged. That's the C in CLDR. Seed data came from
IBM’s ICIR which dates many decades before 15897 (example
http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/
- 4th edition published in 1994.) 100 locales we contributed to glibc as
well.

Where there is opportunity for productive sync and merging with is glibc.
We have had some discussions, but more needs to be done- especially a lot
of tooling work. Currently many bug reports are duplicated between glibc
and cldr, a sort of manual synchronization. Help wanted here.

Steven

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 08:50:28 -0400, Tom Gewecke via Unicode wrote:
> 
> 
> > On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode  wrote:
> > 
> > What bothered me ... is that the registration of the French locale in CLDR 
> > is 
> > still surprisingly incomplete
> 
> Could you provide an example or two?
> 

What got me started is that "Characters | Category | Label | keycap" remained 
untranslated, i.e. its French translation was "keycap". 

A number of keyword translations are missing or wrong. I can tell that all 
actual contributors are working hard to fix the issues.
I can imagine that it’s by lack of time in front of the huge mass of data, or 
by feeling so alone (only three corporate contributors, 
no liaison or NGOs). No wonder if the official French translators are all 
sulking the job (reportedly, not me figuring out).

Marcel

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 13:06:18 +0200, Mark Davis ☕️ via Unicode wrote:
> 
> Where are you getting your "facts"? Among many unsubstantiated or ambiguous 
> claims in that very long sentence:
>
> > "French locale in CLDR is still surprisingly incomplete". 
>
> For each release, the data collected for the French locale is complete to the 
> bar we have set for Level=Modern.

What got me started is that before even I requested a submitter ID (and the 
reason why I’ve requested one), 
"Characters | Category | Label | keycap" remained untranslated, i.e. its French 
translation was "keycap".
When I proposed "cabochon", the present contributors kindly upvoted or proposed 
"touche" even before I 
launched a forum thread, and when I got aware, I changed my vote and posted the 
rationale on the forum, 
so the upvoting contributor kindly followed so that now we stay united for 
"touche", rather than "keycap".

Please note that I acknowledge everybody and don’t criticize anybody. It 
doesn’t require much imagination 
to figure out that when CLDR was set up, there were so few or even no French 
contributors that translating 
"keycap" either fell out of deadline or was overlooked or whatever, and later 
passed unnoticed. That is a 
tracer detecting that none of the people setting up the French translation of 
the Code Charts were ever on 
the CLDR project. Because if anybody of them had been active on CLDR, no 
English word would have been 
kept in use mistakenly for the French locale.

Beyond what everybody on this List is able to decrypt on his or her own, I’m 
not in a position to disclose 
any further personal information, for witness protection’s sake.

> What you may mean is that CLDR doesn't support a structure that you think it 
> should.
> For that, you have to make a compelling case that the structure you propose 
> is worth it, worth diverting people from other priorities.

Thank you, that is not a problem and may be resolved after filing a ticket, 
which would be done for a later release, given that 
top priority tasks require a potentially huge amount of work. First NBSP and 
NNBSP need to be added to the French charset (see
http://unicode.org/cldr/trac/ticket/11120
). Adding centuries to Date (with French short form "s.") is of interest 
for any locale, but irrelevant to everyday business practice.

>
> French contributors are not "prevented from cooperating". Where do you get 
> this from? Who do you mean?

Historic French contributors are ethically prevented from contributing to CLDR, 
because of a strong commitment to involve ISO/IEC, 
a notion that is very meaningful to Unicode. People relevant to projects for 
French locale do trace the borderline of applicability wider 
than do those people who are closerly tied to Unicode‐related projects.

>
> We have many French contribute data over time.

When finding the word "keycap" as a French translation of "keycap" in my copy 
of CLDR data at home, I wanted to know who contributed 
that data. I was told that when survey is open, I’ll see who is contributing. I 
won’t blame those who are helping resolve the issue now.

> Now, it works better when people engage under the umbrella of an 
> organization, but even there that doesn't have to be a company;
> we have liaison relationships with government agencies and NGOs.

That’s fine. But even as a guest I’m well received, and anyhow the point is to 
bring the arguments. 

My concern is that starting with a good translation from scratch is more 
efficient than attempting to correct the same error(s) 
across multiple instances via the survey tool, that seems to be designed to fix 
small errors rather than to redesign entire parts 
of the scheme. 

>
> There were not "many attempts" at a merger, and Unicode didn't "refuse" 
> anything. Who do you think "attempted", and when?

An influential person consistently campaigned for a merger of CLDR and ISO/IEC 
15897, but that never succeeded. It’s unlikely to be ignored.

>
> Albeit given the state of ISO/IEC 15897, there was nothing such a merger 
> would have contributed anyway.

I’ve took a glance at the data of ISO/IEC 15897 and cannot figure out that 
there is nothing to pick from. At least they won’t be disposed to 
sell you "keycap" as a French term or as being in any use in that target 
locale. And anyhow, the gesture would be appreciated as a piece 
of good diplomacy. Hopefully a lightweight proceeding could end up in that data 
being transferred to CLDR, and this being cited as sole 
normative reference in ISO/IEC 15897. As a result, everybody’s happy.

> BTW, your use of the term "refuse" might be a language issue. I don't 
> "refuse" to respond
> to the widow of a Nigerian Prince who wants to give me $1M. Since I don't 
> think it is worth my time,
> or am not willing to upfront the low, low fee of $10K, I might "ignore" the 
> email, or "not respond" to it.
> Or I might "decline" it with a no-thanks or not-interested response. But none 
> of that is to "refuse"

Re: The Unicode Standard and ISO

2018-06-08 Thread Tom Gewecke via Unicode



> On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode 
>  wrote:
> 
> What bothered me ... is that the registration of the French locale in CLDR is 
> still surprisingly incomplete

Could you provide an example or two?

Re: The Unicode Standard and ISO

2018-06-08 Thread Andrew West via Unicode

On 8 June 2018 at 13:01, Michael Everson via Unicode
 wrote:
>
> I wonder if Mark Davis will be quick to agree with me  when I say that 
> ISO/IEC 15897 has no use and should be withdrawn.

It was reviewed and confirmed in 2017, so the next systematic review
won't be until 2022. And as the standard is now under SC35, national
committees mirroring SC2 may well overlook (or be unable to provide
feedback to) the systematic review when it next comes around. I agree
that ISO/IEC 15897 has no use, and should be withdrawn.

Andrew

Re: The Unicode Standard and ISO

2018-06-08 Thread Michael Everson via Unicode

On 8 Jun 2018, at 04:32, Marcel Schneider via Unicode  
wrote:

> the registration of the French locale in CLDR is still surprisingly 
> incomplete despite the meritorious efforts made by the actual contributors

Nothing prevents people from working to complete the French locale in CLDR. 
Synchronization with an unused ISO standard is not necessary to do that. 

Michael Everson

Re: The Unicode Standard and ISO

2018-06-08 Thread Michael Everson via Unicode

On 7 Jun 2018, at 20:13, Marcel Schneider via Unicode  
wrote:

> On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
>> 
>> It would be great if mutual synchronization were considered to be of benefit.
>> Some of us in SC2 are not happy that the Unicode Consortium has published 
>> characters
>> which are still under Technical ballot. And this did not happen only once. 
> 
> I’m not happy catching up this thread out of time, the less as it ultimately 
> brings me where I’ve started 
> in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger 
> infiltrated into Unicode.

Many things have more than one name. The only truly bad misnomers from that 
period was related to a mapping error, namely, in the treatment of Latvian 
characters which are called CEDILLA rather than COMMA BELOW.

> This is the very thing I did not vent in my first reply. From my point of 
> view, this misfortune would be 
> reason enough for Unicode not to seek further cooperation with ISO/IEC.

This is absolutely NOT what we want. What we want is for the two parties to 
remember that industrial concerns and public concerns work best together. 

> But I remember the many voices raising on this List to tell me that this is 
> all over and forgiven.

I think you are digging up an old grudge that nobody thinks about any longer. 

> Therefore I’m confident that the Consortium will have the mindfulness to 
> complete the ISO/IEC JTC 1 
> partnership by publicly assuming synchronization with ISO/IEC 14651,

There is no trouble with ISO/IEC 14651. 

> and achieving a fullscale merger with ISO/IEC 15897, after which the valid 
> data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. 

I wonder if Mark Davis will be quick to agree with me  when I say that ISO/IEC 
15897 has no use and should be withdrawn. 

Michael Everson

Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode

Mark

On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
>
> > Thank you for confirming. All witnesses concur to invalidate the
> > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > After being invented in its actual form, sorting was standardized
> > simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> > the latter including practice‐oriented extra features.
>
> The UCA contains features essential for respecting canonical
> equivalence.  ICU works hard to avoid the extra effort involved,
> apparently even going to the extreme of implicitly declaring that
> Vietnamese is not a human language.

A bit over the top, eh?

> (Some contractions are not
> supported by ICU!)

I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, which
nicely outlines a proposal for dealing with a number of problems with
Vietnamese.

We clearly don't support every sorting feature that various dictionaries
and agencies come up with. Sometimes it is because we can't (yet) see a
good way to do it:

   1. it might be not determinant: many governmental standards or style
   sheets require "interesting" sorting, such as determining that "XI" is a
   roman numeral (not the president of China) and sorting as 11, or when "St."
   is meant to be Street *and* when meant to be Saint (St. Stephen's St.)
   2. the prospective cost in memory, code complexity, or performance, or
   the time necessary to figure out to do complex requirements, doesn't seem
   to warrant adding it at this point. Now, if you or others are interested
   in proposing specific patches to address certain issues, then you can
   propose that. Best to make a proposal (ticket) before doing the work,
   because if the solution is very intricate, even the time necessary to
   evaluate the patch can be too much to fit into the schedule. For that
   reason, it is best to break up such tickets into small, tractable pieces.

The synchronisation is manifest in the DUCET
> collation, which seems to make the effort to ensure that some canonical
> equivalent will sort the same way under ISO/IEC 14651.
>
> > Since then,
> > these two standards are kept in synchrony uninterruptedly.
>
> But the consortium has formally dropped the commitment to DUCET in
> CLDR.  Even when restricted to strings of assigned characters, the CLDR
> and ICU no longer make the effort to support the DUCET collation.
> Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
> collation, even when restricted to assigned characters.  Tailorings
> tend to have odd side effects; fortunately, they rarely if ever matter.
> CLDR root is a rewrite with modifications of DUCET; it has changes that
> are prohibited as 'tailorings'!
>

CLDR does make some tailorings to the DUCET to create its root collation,
notably adding special contractions of private use characters to allow for
tailoring support and indexes [
http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
]  plus the rearrangement of some characters (mostly punctuation and
symbols) to allow runtime parametric reordering of groups of characters (eg
to put numbers after letters) [
http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
].

   - If there are other changes that are not well documented, or if you
   think those features are causing problems in some way, please file a
   ticket.
   - If there is a particular change that you think is not conformant to
   UCA, please also file that.

> Richard.
>
>

Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode

Where are you getting your "facts"? Among many unsubstantiated or ambiguous
claims in that very long sentence:

   1. "French locale in CLDR is still surprisingly incomplete".
  1. For each release, the data collected for the French locale is
  complete to the bar we have set for Level=Modern.
  2. What you may mean is that CLDR doesn't support a structure that
  you think it should. For that, you have to make a compelling
case that the
  structure you propose is worth it, worth diverting people from other
  priorities.
   2. French contributors are not "prevented from cooperating". Where do
   you get this from? Who do you mean?
  1. We have many French contribute data over time. Now, it works
  better when people engage under the umbrella of an organization, but even
  there that doesn't have to be a company; we have liaison
relationships with
  government agencies and NGOs.
   3. There were not "many attempts" at a merger, and Unicode didn't
   "refuse" anything. Who do you think "attempted", and when?
   1. Albeit given the state of ISO/IEC 15897, there was nothing such a
  merger would have contributed anyway.
  2. BTW, your use of the term "refuse" might be a language issue. I
  don't "refuse" to respond to the widow of a Nigerian Prince who wants to
  give me $1M. Since I don't think it is worth my time, or am not
  willing to upfront the low, low fee of $10K, I might "ignore" the
  email, or "not respond" to it. Or I might "decline" it with a
no-thanks or
  not-interested response. But none of that is to "refuse" it.

Mark

On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> >
> > I cannot but fully agree with Mark and Michael.
> >
> > Sincerely
> >
>
> Thank you for confirming. All witnesses concur to invalidate the statement
> about
> uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in
> its
> actual form, sorting was standardized simultaneously in ISO/IEC 14651 and
> in
> Unicode Collation Algorithm, the latter including practice‐oriented extra
> features.
> Since then, these two standards are kept in synchrony uninterruptedly.
>
> Getting people to correct the overall response was not really my initial
> concern,
> however. What bothered me before I learned that Unicode refuses to
> cooperate
> with ISO/IEC JTC1 SC22 is that the registration of the French locale in
> CLDR is
> still surprisingly incomplete despite the meritorious efforts made by the
> actual
> contributors, and then after some investigation, that the main part of the
> potential
> French contributors are prevented from cooperating because Unicode refuses
> to
> cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR,
> reportedly after many attempts made to merge both standards, remaining
> unsuccessful without any striking exposure or friendly agreement to avoid
> kind of
> an impression of unconcerned rebuff.
>
> Best regards,
>
> Marcel
>
>

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
Marcel Schneider via Unicode  wrote:

> Thank you for confirming. All witnesses concur to invalidate the
> statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> After being invented in its actual form, sorting was standardized
> simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> the latter including practice‐oriented extra features. 

The UCA contains features essential for respecting canonical
equivalence.  ICU works hard to avoid the extra effort involved,
apparently even going to the extreme of implicitly declaring that
Vietnamese is not a human language. (Some contractions are not
supported by ICU!)  The synchronisation is manifest in the DUCET
collation, which seems to make the effort to ensure that some canonical
equivalent will sort the same way under ISO/IEC 14651.

> Since then,
> these two standards are kept in synchrony uninterruptedly.

But the consortium has formally dropped the commitment to DUCET in
CLDR.  Even when restricted to strings of assigned characters, the CLDR
and ICU no longer make the effort to support the DUCET collation.
Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
collation, even when restricted to assigned characters.  Tailorings
tend to have odd side effects; fortunately, they rarely if ever matter.
CLDR root is a rewrite with modifications of DUCET; it has changes that
are prohibited as 'tailorings'!

Richard.

Re: The Unicode Standard and ISO

On Fri, 8 Jun 2018 00:43:04 +0200, Philippe Verdy via Unicode wrote:
[cited mail]
>
> The "normative names" are in fact normative only as a forward reference
> to the ISO/IEC repertoire becaus it insists that these names are essential 
> part
> of the stable encoding policy which was then integrated in the Unicode 
> stability rules,
> so that the normative reference remains stable as well). Beside this, Unicode 
> has other
> more useful properties. People don't care at all about these names.

Effectively we have learned to live even with those that are uselessly 
misleading and had 
been pushed through against better proposals made on Unicode side, particularly 
the 
wrong left/right attributes. Unicode have worked hard to palliate these 
misnomers by 
introducing the bidi_bracket (yes, no) and bidi_bracket_type (open, close) 
properties, 
and specifying in TUS that beside a few exceptions, LEFT and RIGHT in names of 
paired punctuation is to be read as OPENING and CLOSING, respectively.

> The character properties and the related algorithms that use them (and even
> the representative glyph even if it's not stabilized) are much more important
> (and the ISO/IEC 101646 does not do anything to solve the real encoding 
> issues,
> and needed properties for correct processing). Unicode is more based on 
> commonly
> used practices and allows experimetnation and progressive enhancing without 
> having
> to break the agreed ISO/EIC normative properties. The position of Unicode is 
> more
> pragmatic, and is much more open to lot of contibutors than the small ISO/IEC 
> subcomities
> with in fact very few active members, but it's still an interesting 
> counter-power that allows
> governments to choose where it is more useful to contribute and have 
> influence when
> the industry may have different needs and practices not foàllowing the 
> government
> recommendations adopted at ISO.

Now it becomes clear to me that this opportunity of governmental action is 
exactly what 
could be useful when it’s up to fix the textual appearance of national user 
interfaces, and 
that is exactly why not federating communities around CLDR, and not attempting 
to get 
efforts converge, is so counter‐productive.

Thanks for getting this point out.

Best regards,

Marcel

RE: The Unicode Standard and ISO

On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> 
> I cannot but fully agree with Mark and Michael.
> 
> Sincerely
> 

Thank you for confirming. All witnesses concur to invalidate the statement 
about 
uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in its 
actual form, sorting was standardized simultaneously in ISO/IEC 14651 and in 
Unicode Collation Algorithm, the latter including practice‐oriented extra 
features. 
Since then, these two standards are kept in synchrony uninterruptedly.

Getting people to correct the overall response was not really my initial 
concern, 
however. What bothered me before I learned that Unicode refuses to cooperate 
with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR is 
still surprisingly incomplete despite the meritorious efforts made by the 
actual 
contributors, and then after some investigation, that the main part of the 
potential 
French contributors are prevented from cooperating because Unicode refuses to 
cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, 
reportedly after many attempts made to merge both standards, remaining
unsuccessful without any striking exposure or friendly agreement to avoid kind 
of 
an impression of unconcerned rebuff.

Best regards,

Marcel

Re: The Unicode Standard and ISO

2018-06-07 Thread Philippe Verdy via Unicode

2018-06-07 21:13 GMT+02:00 Marcel Schneider via Unicode :

> On Thu, 17 May 2018 22:26:15 +, Peter Constable via Unicode wrote:
> […]
> > Hence, from an ISO perspective, ISO 10646 is the only standard for which
> on-going
> > synchronization with Unicode is needed or relevant.
>
> This point of view is fueled by the Unicode Standard being traditionally
> thought of as a mere character set,
> regardless of all efforts—lastly by first responder Asmus Freytag
> himself—to widen the conception.
>
> On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
> >
> > It would be great if mutual synchronization were considered to be of
> benefit.
> > Some of us in SC2 are not happy that the Unicode Consortium has
> published characters
> > which are still under Technical ballot. And this did not happen only
> once.
>
> I’m not happy catching up this thread out of time, the less as it
> ultimately brings me where I’ve started
> in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger
> infiltrated into Unicode.
> This is the very thing I did not vent in my first reply. From my point of
> view, this misfortune would be
> reason enough for Unicode not to seek further cooperation with ISO/IEC.
>

The "normative names" are in fact normative only as a forward reference to
the ISO/IEC repertoire becaus it insists that these names are essential
part of the stable encoding policy which was then integrated in the Unicode
stability rules, so that the normative reference remains stable as well).
Beside this, Unicode has other more useful properties. People don't care at
all about these names. The character properties and the related algorithms
that use them (and even the representative glyph even if it's not
stabilized) are much more important (and the ISO/IEC 101646 does not do
anything to solve the real encoding issues, and needed properties for
correct processing). Unicode is more based on commonly used practices and
allows experimetnation and progressive enhancing without having to break
the agreed ISO/EIC normative properties. The position of Unicode is more
pragmatic, and is much more open to lot of contibutors than the small
ISO/IEC subcomities with in fact very few active members, but it's still an
interesting counter-power that allows governments to choose where it is
more useful to contribute and have influence when the industry may have
different needs and practices not foàllowing the government recommendations
adopted at ISO.

RE: The Unicode Standard and ISO

2018-06-07 Thread via Unicode

I cannot but fully agree with Mark and Michael.

Sincerely

Erkki I. Kolehmainen
Mannerheimintie 75 B 37, 00270 Helsinki, Finland
Mob: +358 400 825 943 

-Alkuperäinen viesti-
Lähettäjä: Unicode  Puolesta Michael Everson via 
Unicode
Lähetetty: torstai 7. kesäkuuta 2018 16.29
Vastaanottaja: unicode Unicode Discussion 
Aihe: Re: The Unicode Standard and ISO

On 7 Jun 2018, at 14:20, Mark Davis ☕️ via Unicode  wrote:
> 
> A few facts. 
> 
>> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above statement is 
> inaccurate.

Mark is right. 

>> > ... For another part it [sync with ISO/IEC 15897] failed because the 
>> > Consortium refused to cooperate, despite of repeated proposals for a 
>> > merger of both instances.
> 
> I recall no serious proposals for that. 

Nor do I.

> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
> considerable costs of maintaining synchrony. Completely inadequate structure 
> for modern system requirement, no particular industry support, and scant 
> content: see Wikipedia for "The registry has not been updated since December 
> 2001”.)

Mark is right.

Michael Everson

Re: The Unicode Standard and ISO

On Thu, 17 May 2018 22:26:15 +, Peter Constable via Unicode wrote:
[…]
> Hence, from an ISO perspective, ISO 10646 is the only standard for which 
> on-going
> synchronization with Unicode is needed or relevant. 

This point of view is fueled by the Unicode Standard being traditionally 
thought of as a mere character set, 
regardless of all efforts—lastly by first responder Asmus Freytag himself—to 
widen the conception.

On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
>
> It would be great if mutual synchronization were considered to be of benefit.
> Some of us in SC2 are not happy that the Unicode Consortium has published 
> characters
> which are still under Technical ballot. And this did not happen only once. 

I’m not happy catching up this thread out of time, the less as it ultimately 
brings me where I’ve started 
in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger 
infiltrated into Unicode.
This is the very thing I did not vent in my first reply. From my point of view, 
this misfortune would be 
reason enough for Unicode not to seek further cooperation with ISO/IEC.

But I remember the many voices raising on this List to tell me that this is all 
over and forgiven.
Therefore I’m confident that the Consortium will have the mindfulness to 
complete the ISO/IEC JTC 1 
partnership by publicly assuming synchronization with ISO/IEC 14651, and 
achieving a fullscale merger 
with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, 
and ISO/IEC 15897 would 
be its ISO mirror. 

That is a matter of smart diplomacy, that Unicode may prove again to be great 
in.

Please consider making this move.

Thanks,

Marcel

Re: The Unicode Standard and ISO

On Thu, 7 Jun 2018 15:20:29 +0200, Mark Davis ☕️ via Unicode wrote:
> 
> A few facts. 
>
> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
>
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the
> synchronization level in more detail, but the above statement is inaccurate.
>
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to
> > cooperate, despite of repeated proposals for a merger of both instances.
> 
> I recall no serious proposals for that. 
> 
> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought
> no value to the table. Certainly nothing to outweigh the considerable costs 
> of maintaining synchrony.
> Completely inadequate structure for modern system requirement, no particular 
> industry support, and
> scant content: see Wikipedia for "The registry has not been updated since 
> December 2001".)

Thank you for correcting as of the Unicode ISO/IEC 14651 synchrony; indeed 
while on

http://www.unicode.org/reports/tr10/#Synch_ISO14651

we can read that “This relationship between the two standards is similar to 
that maintained between
the Unicode Standard and ISO/IEC 10646[,]” confusingly there seems to be no 
related FAQ. Even more 
confusingly, a straightforward question like “I was wondering which ISO 
standards other than ISO 10646 
specify the same things as the Unicode Standard” remains ultimately unanswered. 

The reason might be that the “and of those, which ones are actively kept in 
sync” part is really best 
answered by “none.” In fact, while UCA is synched with ISO/IEC 14651, the 
reverse statement is 
reportedly false. Hence, UCA would be what is called an implementation of 
ISO/IEC 14651.

Nevertheless, UAX #10 refers to “The synchronized version of ISO/IEC 14651[,]” 
and mentions a 
“common tool[.]” 

Hence one simple question: Why does the fact that the Unicode-ISO synchrony 
encompasses *two* 
standards remain untold in the first places?

As of ISO/IEC 15897, it would certainly be a piece of good diplomacy that 
Unicode pick the usable 
data in the existing set, and then ISO/IEC 15897 will be in a position to cite 
CLDR as a normative 
reference so that all potential contributors are redirected and may feel free 
to contribute to CLDR.

And it would be nice that Unicode don’t forget to order an additional FAQ about 
the topic, please.

Thanks,

Marcel

Re: The Unicode Standard and ISO

2018-06-07 Thread Michael Everson via Unicode

On 7 Jun 2018, at 14:20, Mark Davis ☕️ via Unicode  wrote:
> 
> A few facts. 
> 
>> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above statement is 
> inaccurate.

Mark is right. 

>> > ... For another part it [sync with ISO/IEC 15897] failed because the 
>> > Consortium refused to cooperate, despite of repeated proposals for a 
>> > merger of both instances.
> 
> I recall no serious proposals for that. 

Nor do I.

> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
> considerable costs of maintaining synchrony. Completely inadequate structure 
> for modern system requirement, no particular industry support, and scant 
> content: see Wikipedia for "The registry has not been updated since December 
> 2001”.)

Mark is right.

Michael Everson

Re: The Unicode Standard and ISO

2018-06-07 Thread Mark Davis ☕️ via Unicode

A few facts.

> ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.

ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could
speak to the synchronization level in more detail, but the above statement
is inaccurate.

> ... For another part it [sync with ISO/IEC 15897] failed because the
Consortium refused to cooperate, despite of
repeated proposals for a merger of both instances.

I recall no serious proposals for that.

(And in any event — very unlike the synchrony with 10646 and 14651 — ISO 15897
brought no value to the table. Certainly nothing to outweigh the
considerable costs of maintaining synchrony. Completely inadequate
structure for modern system requirement, no particular industry support,
and scant content: see Wikipedia for "The registry has not been updated
since December 2001".)

Mark

Mark

On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
> >
> > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > > Hello,
> > >
> > > There are several mentions of synchronization with related standards in
> > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> > > https://www.unicode.org/faq/unicode_iso.html. However, all such
> mentions
> > > never mention anything other than ISO 10646.
> >
> > Because that is the standard for which there is an explicit
> understanding by all involved
> > relating to synchronization. There have been occasionally some
> challenging differences
> > in the process and procedures, but generally the synchronization is
> being maintained,
> > something that's helped by the fact that so many people are active in
> both arenas.
>
> Perhaps the cause-effect relationship is somewhat unclear. I think that
> many people being
> active in both arenas is helped by the fact that there is a strong will to
> maintain synching.
>
> If there were similar policies notably for ISO/IEC 14651 (collation) and
> ISO/IEC 15897
> (locale data), ISO/IEC 10646 would be far from standing alone in the field
> of
> Unicode-ISO/IEC cooperation.
>
> >
> > There are really no other standards where the same is true to the same
> extent.
> > >
> > > I was wondering which ISO standards other than ISO 10646 specify the
> > > same things as the Unicode Standard, and of those, which ones are
> > > actively kept in sync. This would be of importance for standardization
> > > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > > ISO standards is generally preferred in ISO standards.
> > >
> > One of the areas the Unicode Standard differs from ISO 10646 is that its
> conception
> > of a character's identity implicitly contains that character's
> properties - and those are
> > standardized as well and alongside of just name and serial number.
>
> This is probably why, to date, ISO/IEC 10646 features character properties
> by including
> normative references to the Unicode Standard, Standard Annexes, and the
> UCD.
> Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1:
>
> “[…] The list of these characters is determined by having the
> ‘Bidi_Mirrored’ property
> set to ‘Y’ in the Unicode Standard. These values shall be determined
> according to
> the Unicode Standard Bidi Mirrored property (see Clause 2).”
>
> >
> > Many of these properties have associated with them algorithms, e.g. the
> bidi algorithm,
> > that are an essential element of data interchange: if you don't know
> which order in
> > the backing store is expected by the recipient to produce a certain
> display order, you
> > cannot correctly prepare your data.
> >
> > There is one area where standardization in ISO relates to work in
> Unicode that I can
> > think of, and that is sorting.
>
> Yet UCA conforms to ISO/IEC 14651 (where UCA is cited as entry #28 in the
> bibliography).
> The reverse relationship is irrelevant and would be unfair, given that the
> Consortium
> refused till now to synchronize UCA and ISO/IEC 14651.
>
> Here is a need for action.
>
> > However, sorting, beyond the underlying framework,
> > ultimately relates to languages, and language-specific data is now
> housed in CLDR.
> >
> > Early attempts by ISO to standardize a similar framework for locale data
> failed, in
> > part because the framework alone isn't the interesting challenge for a
> repository,
> > instead it is the collection, vetting and management of the data.
>
> For another part it failed because the Consortium refused to cooperate,
> despite of
> repeated proposals for a merger of both instances.
>
> >
> > The reality is that the ISO model and its organizational structures are
> not well suited
> > to the needs of many important area where some form of standardization
> is needed.
> > That's why we have organization like IETF, W3C, Unicode etc..
> >
> > Duplicating all or even part of their effort inside ISO really serves
> nobody's

Re: The Unicode Standard and ISO