Re: The Unicode Standard and ISO [localizable sentences]

2018-06-15 Thread William_J_G Overington via Unicode
> The topic of localizable sentences is now closed on this mail list.
> Please take that topic elsewhere.
> Thank you.

May I please mention, with permission, that there is now a thread to discuss 
the issue of translations and their context that was mentioned?

https://community.serif.com/discussion/112261/a-discussion-about-translations-and-their-context-localizable-sentences-research-project-related

The thread is in the lounge section of the support forum of Serif, the English 
software company that produced the program that I use to produce PDF (Portable 
Document Format) documents.

William Overington

Friday 15 June 2018



Re: The Unicode Standard and ISO

2018-06-13 Thread Marcel Schneider via Unicode
On Tue, 12 Jun 2018 19:49:10 +0200, Mark Davis ☕️ via Unicode wrote:
[…]
> People interested in this topic should 
> (a) start up their own project somewhere else,
> (b) take discussion of it off this list,
> (c) never bring it up again on this list.

Thank you for letting us know. I apologize for my e-mailing. I didn’t respond 
in the wake for a variety of 
reasons while immediately fully agreeing, of course as I had mainly wondered 
why I got no feedback when 
I’d lastly terminated a thread turning likewise, but no matter anymore.
No problem, as far as it belongs to me, this topic will never be read again 
here nor elsewhere.

Sorry again.

Best regards,

Marcel



Re: The Unicode Standard and ISO [localizable sentences]

2018-06-12 Thread Sarasvati via Unicode
The topic of localizable sentences is now closed on this mail list.
Please take that topic elsewhere.
Thank you.


On 6/12/2018 10:49 AM, Mark Davis ☕️ via Unicode wrote:

> That is often a viable approach. But proponents shouldn't get the wrong
impression. I think the chance of anything resembling the "localized
sentences" / "international message components" have  zero chance of being
adopted by Unicode (including the encoding, CLDR, anything). It is a waste
of many people's time discussing it further on this list.

> Why? As discussed many times on this list, it would take a major effort,
is not scoped properly (the translation of messages depends highly on
context, including specific products), and would not meet the needs of
practically anyone.

> People interested in this topic should  
> (a) start up their own project somewhere else, 
> (b) take discussion of it off this list, 
> (c) never bring it up again on this list.



Re: The Unicode Standard and ISO

2018-06-12 Thread Steven R. Loomis via Unicode
On Mon, Jun 11, 2018 at 8:32 AM, William_J_G Overington <
wjgo_10...@btinternet.com> wrote:

> Steven R. Loomis wrote:
>
> >Marcel,
> > The idea is not necessarily without merit. However, CLDR does not
> usually expand scope just because of a suggestion.
>  I usually recommend creating a new project first - gathering data,
> looking at and talking to projects to ascertain the usefulness of common
> messages.. one of the barriers to adding new content for CLDR is not just
> the design, but collecting initial data. When emoji or sub-territory names
> were added, many languages were included before it was added to CLDR.
>
> Well, maybe usually, but perhaps not this time?


Especially this time.
To Mark's later point: Start a separate project. Don't assume it will ever
merge with CLDR. If it succeeds, great.


Re: The Unicode Standard and ISO

2018-06-12 Thread Mark Davis ☕️ via Unicode
Steven wrote:

>  I usually recommend creating a new project first...

That is often a viable approach. But proponents shouldn't get the wrong
impression. I think the chance of anything resembling the "localized
sentences" / "international message components" have  zero chance of being
adopted by Unicode (including the encoding, CLDR, anything). It is a waste
of many people's time discussing it further on this list.

Why? As discussed many times on this list, it would take a major effort, is
not scoped properly (the translation of messages depends highly on context,
including specific products), and would not meet the needs of practically
anyone.

People interested in this topic should
(a) start up their own project somewhere else,
(b) take discussion of it off this list,
(c) never bring it up again on this list.


Mark

On Tue, Jun 12, 2018 at 4:53 PM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

>
> William,
>
> On 12/06/18 12:26, William_J_G Overington wrote:
> >
> > Hi Marcel
> >
> > > I don’t fully disagree with Asmus, as I suggested to make available
> localizable (and effectively localized) libraries of message components,
> rather than of entire messages.
> >
> > Could you possibly give some examples of the message components to which
> you refer please?
> >
>
> Likewise I’d be interested in asking Jonathan Rosenne for an example or
> two of automated translation from English to bidi languages with data
> embedded,
> as on Mon, 11 Jun 2018 15:42:38 +, Jonathan Rosenne via Unicode wrote:
> […]
> > > > One has to see it to believe what happens to messages translated
> mechanically from English to bidi languages when data is embedded in the
> text.
>
> But both would require launching a new thread.
>
> Thinking hard enough, I’m even afraid that most subscribers wouldn’t be
> interested, so we’d have to move off-list.
>
> One alternative I can think of is to use one of the CLDR mailing lists. I
> subscribed to CLDR-users when I was directed to move there some technical
> discussion
> about keyboard layouts from Unicode Public.
>
> But now as international message components are not yet a part of CLDR,
> we’d need to ask for extra permission to do so.
>
> An additional drawback of launching a technical discussion right now is
> that significant parts of CLDR data are not yet correctly localized so
> there is another
> bunch of priorities under July 11 deadline. I guess that vendors wouldn’t
> be glad to see us gathering data for new structures while level=Modern
> isn’t complete.
>
> In the meantime, you are welcome to contribute and to motivate missing
> people to do the same.
>
> Best regards,
>
> Marcel
>
>


Re: The Unicode Standard and ISO

2018-06-12 Thread Steven R. Loomis via Unicode
> ISO 15924 is and ISO standard. Aspects of its content may be mirrored in
other places, but “moving its content” to CLDR makes no sense.

Fully agreed.

For what it's worth, I reopened a bug of Roozbeh's
https://unicode.org/cldr/trac/ticket/827?#comment:9 to make sure the ISO
15924 French content gets properly mirrored into CLDR, it looks like there
is a French-specific bug there, which may be what you are seeing, Marcel.


On Tue, Jun 12, 2018 at 8:57 AM, Michael Everson via Unicode <
unicode@unicode.org> wrote:

> All right, if you want a clear explanation.
>
> Yes, I think the ISO 8859-4 character names for the Latvian letters were
> mistaken. Yes, I think that mapping them to decompositions with CEDILLA
> rather than COMMA BELOW was a mistake. Evidently some felt that the
> normative mapping was important. This does not mean that SC2 “failed to do
> its part” and it did not cause a lack of desire for cooperation, and it
> bloody well did not “damage the reputation of the whole ISO/IEC”.
>
> As to ISO 15924, it was developed bilingually, and there was consensus on
> the names that are there. Last year you suggested a massive number of name
> changes to the French translation of ISO/IEC 10646, and I criticized you
> for foregoing stability for your own preferences. When it came to the names
> in 15924, I told you that I do not trust your judgement, and that I would
> consider revisions to the French names when you came back with consensus on
> those changes with experts Alain LaBonté, Patrick Andries, Denis Jacquerye,
> and Marc Lodewijck. As I have not heard from them, I conclude that no such
> consensus exists.
>
> ISO 15924 is and ISO standard. Aspects of its content may be mirrored in
> other places, but “moving its content” to CLDR makes no sense.
>
> Michael Everson
>
> > On 12 Jun 2018, at 16:20, Marcel Schneider via Unicode <
> unicode@unicode.org> wrote:
> > On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote:
> >>
> >> Marcel,
> >> You have put words into my mouth. Please don’t. Your description of
> what I said is NOT accurate.
> >>
> >>> On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:
> >>> And in this thread I wanted to demonstrate that by focusing on the
> wrong priorities, i.e. legacy character names instead of the practicability
> of on-going encoding and the accurateness of specified decompositions—so
> that in some instances cedilla was used instead of comma below, Michael
> pointed out—, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its
> mission—and thus didn’t inspire a desire of extensive cooperation (and
> damaged the reputation of the whole ISO/IEC).
> >
> > Michael, I’d better quote your actual e-mail:
> >
> > On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote:
> > […]
> >> Many things have more than one name. The only truly bad misnomers from
> that period was related to a mapping error,
> >> namely, in the treatment of Latvian characters which are called CEDILLA
> rather than COMMA BELOW.
> >
> > Now I fail to understand why this mustn’t be reworded to “the
> accurateness of specified decompositions—so that in some instances cedilla
> was used instead of comma below[.]” If any correction can be made, I’d be
> eager to take note. Thanks for correcting.
> >
> > Now let’s append the e-mail that I was about to send:
> >
> > Another ISO Standard that needs to be mentioned in this thread is ISO
> 15924 (script codes; not ISO/IEC). It has a particular status in that
> Unicode is the Registration Authority.
> >
> > I wonder whether people agree that it has a French version. Actually it
> does have a French version, but Michael Everson (Registrar) revealed on
> this List multiple issues with synching French script names in ISO 15924-fr
> and in Code Charts translations.
> >
> > Shouldn’t this content be moved to CLDR? At least with respect to
> localized script names.
>
>
>


Re: The Unicode Standard and ISO

2018-06-12 Thread Asmus Freytag via Unicode

  
  
On 6/12/2018 7:58 AM, Michael Everson
  via Unicode wrote:


  Marcel,

You have put words into my mouth. Please don’t. Your description of what I said is NOT accurate. 


  
On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:

And in this thread I wanted to demonstrate that by focusing on the wrong priorities, i.e. legacy character names instead of the practicability of on-going encoding and the accurateness of specified decompositions—so that in some instances cedilla was used instead of comma below, Michael pointed out—, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission—and thus didn’t inspire a desire of extensive cooperation (and damaged the reputation of the whole ISO/IEC).

  
  




The final conclusion isn't backed by the
evidence. 
  
This kind of fault-finding needs to stop -
it's unproductive.
A./

  



Re: The Unicode Standard and ISO

2018-06-12 Thread Steven R. Loomis via Unicode
CLDR already has localized script names. The English is taken from ISO
15924. https://cldr-ref.unicode.org/cldr-apps/v#/fr/Scripts/

On Tue, Jun 12, 2018 at 8:20 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote:
> >
> > Marcel,
> >
> > You have put words into my mouth. Please don’t. Your description of what
> I said is NOT accurate.
> >
> > > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:
> > >
> > > And in this thread I wanted to demonstrate that by focusing on the
> wrong priorities, i.e. legacy character names instead of
> > > the practicability of on-going encoding and the accurateness of
> specified decompositions—so that in some instances cedilla
> > > was used instead of comma below, Michael pointed out—, ISO/IEC JTC1
> SC2/WG2 failed to do its part and missed its mission—
> > > and thus didn’t inspire a desire of extensive cooperation (and damaged
> the reputation of the whole ISO/IEC).
>
> Michael, I’d better quote your actual e-mail:
>
> On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote:
> […]
> > Many things have more than one name. The only truly bad misnomers from
> that period was related to a mapping error,
> > namely, in the treatment of Latvian characters which are called CEDILLA
> rather than COMMA BELOW.
>
> Now I fail to understand why this mustn’t be reworded to “the accurateness
> of specified decompositions—so that in some instances
> cedilla was used instead of comma below[.]”
> If any correction can be made, I’d be eager to take note.
> Thanks for correcting.
>
> Now let’s append the e-mail that I was about to send:
>
> Another ISO Standard that needs to be mentioned in this thread is ISO
> 15924 (script codes; not ISO/IEC).
> It has a particular status in that Unicode is the Registration Authority.
>
> I wonder whether people agree that it has a French version. Actually it
> does have a French version, but
> Michael Everson (Registrar) revealed on this List multiple issues with
> synching French script names in
> ISO 15924-fr and in Code Charts translations.
>
> Shouldn’t this content be moved to CLDR? At least with respect to
> localized script names.
>


Re: The Unicode Standard and ISO

2018-06-12 Thread Michael Everson via Unicode
All right, if you want a clear explanation.

Yes, I think the ISO 8859-4 character names for the Latvian letters were 
mistaken. Yes, I think that mapping them to decompositions with CEDILLA rather 
than COMMA BELOW was a mistake. Evidently some felt that the normative mapping 
was important. This does not mean that SC2 “failed to do its part” and it did 
not cause a lack of desire for cooperation, and it bloody well did not “damage 
the reputation of the whole ISO/IEC”. 

As to ISO 15924, it was developed bilingually, and there was consensus on the 
names that are there. Last year you suggested a massive number of name changes 
to the French translation of ISO/IEC 10646, and I criticized you for foregoing 
stability for your own preferences. When it came to the names in 15924, I told 
you that I do not trust your judgement, and that I would consider revisions to 
the French names when you came back with consensus on those changes with 
experts Alain LaBonté, Patrick Andries, Denis Jacquerye, and Marc Lodewijck. As 
I have not heard from them, I conclude that no such consensus exists. 

ISO 15924 is and ISO standard. Aspects of its content may be mirrored in other 
places, but “moving its content” to CLDR makes no sense. 

Michael Everson

> On 12 Jun 2018, at 16:20, Marcel Schneider via Unicode  
> wrote:
> On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote:
>> 
>> Marcel,
>> You have put words into my mouth. Please don’t. Your description of what I 
>> said is NOT accurate. 
>> 
>>> On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:
>>> And in this thread I wanted to demonstrate that by focusing on the wrong 
>>> priorities, i.e. legacy character names instead of the practicability of 
>>> on-going encoding and the accurateness of specified decompositions—so that 
>>> in some instances cedilla was used instead of comma below, Michael pointed 
>>> out—, ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission—and 
>>> thus didn’t inspire a desire of extensive cooperation (and damaged the 
>>> reputation of the whole ISO/IEC).
> 
> Michael, I’d better quote your actual e-mail:
> 
> On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote:
> […]
>> Many things have more than one name. The only truly bad misnomers from that 
>> period was related to a mapping error,
>> namely, in the treatment of Latvian characters which are called CEDILLA 
>> rather than COMMA BELOW. 
> 
> Now I fail to understand why this mustn’t be reworded to “the accurateness of 
> specified decompositions—so that in some instances cedilla was used instead 
> of comma below[.]” If any correction can be made, I’d be eager to take note. 
> Thanks for correcting.
> 
> Now let’s append the e-mail that I was about to send:
> 
> Another ISO Standard that needs to be mentioned in this thread is ISO 15924 
> (script codes; not ISO/IEC). It has a particular status in that Unicode is 
> the Registration Authority. 
> 
> I wonder whether people agree that it has a French version. Actually it does 
> have a French version, but Michael Everson (Registrar) revealed on this List 
> multiple issues with synching French script names in ISO 15924-fr and in Code 
> Charts translations.
> 
> Shouldn’t this content be moved to CLDR? At least with respect to localized 
> script names.





Re: The Unicode Standard and ISO

2018-06-12 Thread Marcel Schneider via Unicode
On Tue, 12 Jun 2018 15:58:09 +0100, Michael Everson via Unicode wrote:
> 
> Marcel,
> 
> You have put words into my mouth. Please don’t. Your description of what I 
> said is NOT accurate. 
> 
> > On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  wrote:
> > 
> > And in this thread I wanted to demonstrate that by focusing on the wrong 
> > priorities, i.e. legacy character names instead of
> > the practicability of on-going encoding and the accurateness of specified 
> > decompositions—so that in some instances cedilla
> > was used instead of comma below, Michael pointed out—, ISO/IEC JTC1 SC2/WG2 
> > failed to do its part and missed its mission—
> > and thus didn’t inspire a desire of extensive cooperation (and damaged the 
> > reputation of the whole ISO/IEC).

Michael, I’d better quote your actual e-mail:

On Fri, 8 Jun 2018 13:01:48 +0100, Michael Everson via Unicode wrote:
[…]
> Many things have more than one name. The only truly bad misnomers from that 
> period was related to a mapping error,
> namely, in the treatment of Latvian characters which are called CEDILLA 
> rather than COMMA BELOW. 

Now I fail to understand why this mustn’t be reworded to “the accurateness of 
specified decompositions—so that in some instances 
cedilla was used instead of comma below[.]”
If any correction can be made, I’d be eager to take note.
Thanks for correcting.

Now let’s append the e-mail that I was about to send:

Another ISO Standard that needs to be mentioned in this thread is ISO 15924 
(script codes; not ISO/IEC).
It has a particular status in that Unicode is the Registration Authority. 

I wonder whether people agree that it has a French version. Actually it does 
have a French version, but 
Michael Everson (Registrar) revealed on this List multiple issues with synching 
French script names in 
ISO 15924-fr and in Code Charts translations.

Shouldn’t this content be moved to CLDR? At least with respect to localized 
script names.



Re: The Unicode Standard and ISO

2018-06-12 Thread Michael Everson via Unicode
Marcel,

You have put words into my mouth. Please don’t. Your description of what I said 
is NOT accurate. 

> On 12 Jun 2018, at 03:53, Marcel Schneider via Unicode  
> wrote:
> 
> And in this thread I wanted to demonstrate that by focusing on the wrong 
> priorities, i.e. legacy character names instead of the practicability of 
> on-going encoding and the accurateness of specified decompositions—so that in 
> some instances cedilla was used instead of comma below, Michael pointed out—, 
> ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission—and thus 
> didn’t inspire a desire of extensive cooperation (and damaged the reputation 
> of the whole ISO/IEC).




Re: The Unicode Standard and ISO

2018-06-12 Thread Marcel Schneider via Unicode


William,

On 12/06/18 12:26, William_J_G Overington wrote:
> 
> Hi Marcel
> 
> > I don’t fully disagree with Asmus, as I suggested to make available 
> > localizable (and effectively localized) libraries of message components, 
> > rather than of entire messages.
> 
> Could you possibly give some examples of the message components to which you 
> refer please?
> 

Likewise I’d be interested in asking Jonathan Rosenne for an example or two of 
automated translation from English to bidi languages with data embedded, 
as on Mon, 11 Jun 2018 15:42:38 +, Jonathan Rosenne via Unicode wrote:
[…]
> > > One has to see it to believe what happens to messages translated 
> > > mechanically from English to bidi languages when data is embedded in the 
> > > text. 

But both would require launching a new thread. 

Thinking hard enough, I’m even afraid that most subscribers wouldn’t be 
interested, so we’d have to move off-list. 

One alternative I can think of is to use one of the CLDR mailing lists. I 
subscribed to CLDR-users when I was directed to move there some technical 
discussion 
about keyboard layouts from Unicode Public.

But now as international message components are not yet a part of CLDR, we’d 
need to ask for extra permission to do so.

An additional drawback of launching a technical discussion right now is that 
significant parts of CLDR data are not yet correctly localized so there is 
another
bunch of priorities under July 11 deadline. I guess that vendors wouldn’t be 
glad to see us gathering data for new structures while level=Modern isn’t 
complete.

In the meantime, you are welcome to contribute and to motivate missing people 
to do the same.

Best regards,

Marcel



Re: The Unicode Standard and ISO

2018-06-12 Thread William_J_G Overington via Unicode
Hi Marcel

> I don’t fully disagree with Asmus, as I suggested to make available 
> localizable (and effectively localized) libraries of message components, 
> rather than of entire messages.

Could you possibly give some examples of the message components to which you 
refer please?

Asmus wrote:

> A middle ground is a shared terminology database that allows translators 
> working on different products to arrive at the same translation for the same 
> things. Translators already know how to use such databases in their work 
> flow, and integrating a shared one with a product-specific one is much easier 
> than trying to deal with a set of random error messages.

I am not a linguist. I am interested in languages but my knowledge of languages 
is little more than that of general education, though I have written a song in 
French.

http://www.users.globalnet.co.uk/~ngo/une_chanson.pdf

So when Asmus wrote "Translators already know how to use such databases in 
their work flow, ", I do not know how to do that myself.

> The challenge as I see it is to get them translated to all locales.

Well, yes, that is a big challenge.

It depends whether people want to get it done.

In England, with its changeable weather, part of the culture is to talk about 
the weather. For example, at a bus stop talking about the weather with other 
people: it is sociable without being intrusive or controversial. Alas it did 
not occur to me that that might seem strange to some people who are not from 
England.

http://www.english-at-home.com/speaking/talking-about-the-weather/

http://www.bbc.com/future/story/20151214-why-do-brits-talk-about-the-weather-so-much

I remember when I wrote about localizable sentences in this mailing list in 
mid-April 2009, using sentences about the weather, I hoped, in hindsight rather 
naively, that people on the mailing list would be interested and that 
translations into many languages would be posted and then things would get 
going.

In the event, only one person, Magnus Bodin, provided translations. Magnus 
provided translations into Swedish and also provided a translation for an 
additional sentence as well. I knew no Swedish myself. These translations have 
been extremely helpful in my research project as they demonstrate communication 
through the language barrier using encoded localizable sentences.

Yesterday I provided three example error message sentences.

https://www.unicode.org/mail-arch/unicode-ml/y2018-m06/0088.html

Please consider one of them, which could be output as a code number, say, 
::4842357:; from an application program if someone enters a letter of the 
alphabet into a curency field, and then displayed localized into a language by 
first decoding using a sentence.dat UTF-16 text file for that language that 
includes a line that starts ::4842357:;| and then has the localization into 
that particular language, the language being any language that can be displayed 
using Unicode.

For English, the line in the sentence.dat file would be as follows.

::4842357:;|Data entry for the currency field must be either a whole positive 
number or a positive number to exactly two decimal places.

It would be great if some bilingual readers of this mailing list were to post a 
translation of the above line of text into another language.

In my research I am using an integral sign as a base character and circled 
digit characters.

If possible, a character such as U+FFF7 could be encoded to be the base 
character as that would provide a unique unambiguous link to star space from 
Unicode plain text. However whether that happens at some future time will 
depend upon there being sufficient interest at that future time in using 
localizable sentences for communication through the language barrier.

William Overington

Tuesday 12 June 2018




Re: The Unicode Standard and ISO

2018-06-11 Thread Marcel Schneider via Unicode
On Mon, 11 Jun 2018 16:32:45 +0100 (BST), William_J_G Overington via Unicode 
wrote:
[…]
> Asmus Freytag wrote:
> 
> > If you tried to standardize all error messages even in one language you 
> > would never arrive at something that would be universally useful.
> 
> Well that is a big "If". One cannot standardize all pictures as emoji, but 
> emoji still get encoded, some every year now.
> 
> I first learned to program back in the 1960s using the Algol 60 language on 
> an Elliott 803 mainframe computer, five track paper tape,
> teleprinters to prepare a program on white tape, results out on coloured 
> tape, colours changed when the rolls changed. If I remember
> correctly, error messages, either at compile time or at run time came out as 
> messages of a line number and an error number for compile
> time errors and a number for a run time error. One then looked up the number 
> in the manual or on the enlarged version of the numbers
> and the corresponding error messages that was mounted on the wall.
> 
> > While some simple applications may find that all their needs for 
> > communicating with their users are covered, most would wish they had
> > some other messages available.
> 
> Yes, but more messages could be added to the list much more often than emoji 
> are added to The Unicode Standard, maybe every month
> or every fortnight or every week if needed.
> 
> > To adopt your scheme, they would need to have a bifurcated approach, where 
> > some messages follow the standard, while others do not (cannot).
> 
> Not necessarily. A developer would just need to send in a request to Unicode 
> Inc. to add the needed extra sentences to the list and get a code number.
> 
> > It's pushing this kind of impractical scheme that gives standardizers a bad 
> > name.
> 
> It is not an impractical scheme.

I don’t fully disagree with Asmus, as I suggested to make available localizable 
(and effectively localized) libraries of message components, rather than 
of entire messages. The challenge as I see it is to get them translated to all 
locales. For this I'm hoping that the advantage of improving user support 
upstream instead of spending more time on support fora would be obvious.

By contrast I do disagree with the idea that industrial standards (as opposed 
to governmental procurement) are a safeguard against impractical schemes.
Devising impractical specifications on industrial procurement hasn't even been 
a privilege of the French NB (referring to the examples in my e-mail:
https://unicode.org/mail-arch/unicode-ml/y2018-m06/0082.html
), as demonstrated with the example of the hyphen conundrum where Unicode 
pushes the use of keyboard layouts featuring two distinct hyphens with 
same general category and same behavior, but different glyphs in some fonts 
whose designers didn’t think further than the original point of overly 
disambiguating hyphen semantics—while getting around similar traps with other 
punctuations.

And in this thread I wanted to demonstrate that by focusing on the wrong 
priorities, i.e. legacy character names instead of the practicability of 
on-going 
encoding and the accurateness of specified decompositions—so that in some 
instances cedilla was used instead of comma below, Michael pointed out—, 
ISO/IEC JTC1 SC2/WG2 failed to do its part and missed its mission—and thus 
didn’t inspire a desire of extensive cooperation (and damaged the reputation 
of the whole ISO/IEC).

Best regards,

Marcel



Re: The Unicode Standard and ISO

2018-06-11 Thread William_J_G Overington via Unicode
Steven R. Loomis wrote:

>Marcel,
> The idea is not necessarily without merit. However, CLDR does not usually 
> expand scope just because of a suggestion.
 I usually recommend creating a new project first - gathering data, looking at 
and talking to projects to ascertain the usefulness of common messages.. one of 
the barriers to adding new content for CLDR is not just the design, but 
collecting initial data. When emoji or sub-territory names were added, many 
languages were included before it was added to CLDR.

Well, maybe usually, but perhaps not this time? I opine that if it is going to 
be done it needs to be done under the umbrella of Unicode Inc. and have lots of 
people contribute a bit: that way businesses may well use it because being part 
of Unicode Inc. they will have provenance over there being no possibility of 
later claims for payment. Not that any such claim would necessarily be made, 
but they need to know that. Also having lots of people can help get the 
translations done as there are a number of people who are bilingual who might 
like to pitch in. So, give the idea a sound chance of being implemented please.

Asmus Freytag wrote:

> If you tried to standardize all error messages even in one language you would 
> never arrive at something that would be universally useful.

Well that is a big "If". One cannot standardize all pictures as emoji, but 
emoji still get encoded, some every year now.

I first learned to program back in the 1960s using the Algol 60 language on an 
Elliott 803 mainframe computer, five track paper tape, teleprinters to prepare 
a program on white tape, results out on coloured tape, colours changed when the 
rolls changed. If I remember correctly, error messages, either at compile time 
or at run time came out as messages of a line number and an error number for 
compile time errors and a number for a run time error. One then looked up the 
number in the manual or on the enlarged version of the numbers and the 
corresponding error messages that was mounted on the wall.

> While some simple applications may find that all their needs for 
> communicating with their users are covered, most would wish they had some 
> other messages available.

Yes, but more messages could be added to the list much more often than emoji 
are added to The Unicode Standard, maybe every month or every fortnight or 
every week if needed.

> To adopt your scheme, they would need to have a bifurcated approach, where 
> some messages follow the standard, while others do not (cannot).

Not necessarily. A developer would just need to send in a request to Unicode 
Inc. to add the needed extra sentences to the list and get a code number.

> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name.

It is not an impractical scheme.

It can be implemented straightforwardly using the star space system that I have 
devised.

http://www.users.globalnet.co.uk/~ngo/An_encoding_space_designed_for_application_in_encoding_localizable_sentences.pdf

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_019.pdf

Start off with space for  error messages and number them from 4840001 
through to 484999 and allocate meanings as needed.

Then a side view of a 4-8-4 locomotive facing to the left could be a logo for 
the project.

Big 4-8-4 locomotives were built years ago. If people could do that then surely 
people can implement this project successfully now if they want to do so. 

For example, one error message could be as follows:

Data entry for the currency field must be either a whole positive number or a 
positive number to exactly two decimal places.

Another could be as follows:

Division by zero was attempted.

Yet another could be as follows:

The number of opening parentheses in the expression does not match the number 
of closing parentheses.

If some day more than  error messages are needed, these can be provided 
within star space as it is vast.

http://www.users.globalnet.co.uk/~ngo/a_completed_publication_about_localizable_sentences_research.pdf

William Overington

Monday 11 June 2018


RE: The Unicode Standard and ISO

2018-06-11 Thread Jonathan Rosenne via Unicode
The scheme I have been using for years is a short message in the local language 
giving the main point of the error, together with a detailed message in English.

One has to see it to believe what happens to messages translated mechanically 
from English to bidi languages when data is embedded in the text.

Best Regards,

Jonathan Rosenne
-Original Message-
From: William_J_G Overington [mailto:wjgo_10...@btinternet.com] 
Sent: Monday, June 11, 2018 6:33 PM
To: verd...@wanadoo.fr; Jonathan Rosenne; asm...@ix.netcom.com; Steven R. 
Loomis; jameskass...@gmail.com; charupd...@orange.fr; peter...@microsoft.com; 
richard.wording...@ntlworld.com
Cc: unicode@unicode.org
Subject: Re: The Unicode Standard and ISO

Steven R. Loomis wrote:

>Marcel,
> The idea is not necessarily without merit. However, CLDR does not usually 
> expand scope just because of a suggestion.
 I usually recommend creating a new project first - gathering data, looking at 
and talking to projects to ascertain the usefulness of common messages.. one of 
the barriers to adding new content for CLDR is not just the design, but 
collecting initial data. When emoji or sub-territory names were added, many 
languages were included before it was added to CLDR.

Well, maybe usually, but perhaps not this time? I opine that if it is going to 
be done it needs to be done under the umbrella of Unicode Inc. and have lots of 
people contribute a bit: that way businesses may well use it because being part 
of Unicode Inc. they will have provenance over there being no possibility of 
later claims for payment. Not that any such claim would necessarily be made, 
but they need to know that. Also having lots of people can help get the 
translations done as there are a number of people who are bilingual who might 
like to pitch in. So, give the idea a sound chance of being implemented please.

Asmus Freytag wrote:

> If you tried to standardize all error messages even in one language you would 
> never arrive at something that would be universally useful.

Well that is a big "If". One cannot standardize all pictures as emoji, but 
emoji still get encoded, some every year now.

I first learned to program back in the 1960s using the Algol 60 language on an 
Elliott 803 mainframe computer, five track paper tape, teleprinters to prepare 
a program on white tape, results out on coloured tape, colours changed when the 
rolls changed. If I remember correctly, error messages, either at compile time 
or at run time came out as messages of a line number and an error number for 
compile time errors and a number for a run time error. One then looked up the 
number in the manual or on the enlarged version of the numbers and the 
corresponding error messages that was mounted on the wall.

> While some simple applications may find that all their needs for 
> communicating with their users are covered, most would wish they had some 
> other messages available.

Yes, but more messages could be added to the list much more often than emoji 
are added to The Unicode Standard, maybe every month or every fortnight or 
every week if needed.

> To adopt your scheme, they would need to have a bifurcated approach, where 
> some messages follow the standard, while others do not (cannot).

Not necessarily. A developer would just need to send in a request to Unicode 
Inc. to add the needed extra sentences to the list and get a code number.

> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name.

It is not an impractical scheme.

It can be implemented straightforwardly using the star space system that I have 
devised.

http://www.users.globalnet.co.uk/~ngo/An_encoding_space_designed_for_application_in_encoding_localizable_sentences.pdf

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_019.pdf

Start off with space for  error messages and number them from 4840001 
through to 484999 and allocate meanings as needed.

Then a side view of a 4-8-4 locomotive facing to the left could be a logo for 
the project.

Big 4-8-4 locomotives were built years ago. If people could do that then surely 
people can implement this project successfully now if they want to do so. 

For example, one error message could be as follows:

Data entry for the currency field must be either a whole positive number or a 
positive number to exactly two decimal places.

Another could be as follows:

Division by zero was attempted.

Yet another could be as follows:

The number of opening parentheses in the expression does not match the number 
of closing parentheses.

If some day more than  error messages are needed, these can be provided 
within star space as it is vast.

http://www.users.globalnet.co.uk/~ngo/a_completed_publication_about_localizable_sentences_research.pdf

William Overington

Monday 11 June 2018



RE: The Unicode Standard and ISO

2018-06-11 Thread Marcel Schneider via Unicode
> > From the outset, Unicode and the US national body tried repeatedly to 
> > engage with SC35 and SC35/WG5,
[…]
> As a reminder: The actual SC35 is in total disconnect from the same SC35 as 
> it was from the mid-eighties to mid-nineties and beyond.

Edit: ISO/IEC JTC1 SC35 was founded in 1999. (In the mentioned timespan, there 
was SC18/WG9.)

> > informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn’t 
> > appear to be interested
> [, or appeared to be interested in ]
> > a pet project and not in what is actually being used in industry.

It seems it isn’t even a pet project, today it’s just nothing but a deplorable 
mismanagement mess. In my opinion, at 
some point the inadvertant French NB will apologize to the US National Body and 
to the Unicode Consortium.

As of now, I apologize for my part.

Best regards,

Marcel



RE: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode
On Sun, 10 Jun 2018 15:11:48 +, Peter Constable via Unicode wrote:
> 
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to cooperate, despite of
> > repeated proposals for a merger of both instances.
> 
> First, ISO/IEC 15897 is built on a data-format specification, ISO/IEC TR 
> 14652, that never achieved the support
> needed to become an international standard, and has since been withdrawn. 
> (TRs cannot remain TRs forever.)
> Now, JTC1/SC35 began work four or five years ago to create data-format 
> specification for this, Approved Work Item 30112.
> From the outset, Unicode and the US national body tried repeatedly to engage 
> with SC35 and SC35/WG5,

The involvement in this decade of ISO/IEC JTC1 SC35 WG5 adds a scary level of 
complexity unrelated to the core issues. 
Andrew West already hinted that the stuff was moved from SC22 to SC35, but it 
took me some extra investigation to get the point.
As a reminder: The actual SC35 is in total disconnect from the same SC35 as it 
was from the mid-eighties to mid-nineties and beyond.

> informing them of UTS #35 (LDML) and CLDR, but were ignored. SC35 didn’t 
> appear to be interested
[, or appeared to be interested in ]
> a pet project and not in what is actually being used in industry.

Sorry, I experienced some difficulty to understand and filled in what I think 
could have been elided.

> After several failed attempts, Unicode and the USNB gave up trying.

Thank you for bringing up this key information.

> 
> So, any suggestion that Unicode has failed to cooperate or is is dropping the 
> ball with regard to locale data and ISO
> is simply uninformed.

That is exact. 

So I think this thread has now led to a main response, and all concerned people 
on this List are welcome 
to take note of these new facts showing that Unicode is totally innocent in 
ISO/IEC locale data issues.

If that doesn’t suffice to convince missing people to cooperate in reviewing 
French data in CLDR, 
they may be pleased to know that I try to keep helping do our best.

Thank you everyone.

Best regards,

Marcel

> 
> 
> Peter
> 
> 
> From: Unicode  On Behalf Of Mark Davis ?? via Unicode
> Sent: Thursday, June 7, 2018 6:20 AM
> To: Marcel Schneider 
> Cc: UnicodeMailing 
> Subject: Re: The Unicode Standard and ISO
> 
> A few facts.
> 
> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above 
statement is inaccurate.
> 
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to cooperate, despite of
> repeated proposals for a merger of both instances.
> 
> I recall no serious proposals for that.
> 
> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
considerable costs of maintaining synchrony. Completely inadequate structure 
for modern system requirement, no particular industry support, and scant 
content: see Wikipedia for "The registry has not been updated since December 
2001".)
> 
> Mark
> 
[…]



RE: The Unicode Standard and ISO

2018-06-10 Thread Peter Constable via Unicode
> ... For another part it [sync with ISO/IEC 15897] failed because the 
> Consortium refused to cooperate, despite of
repeated proposals for a merger of both instances.

First, ISO/IEC 15897 is built on a data-format specification, ISO/IEC TR 14652, 
that never achieved the support needed to become an international standard, and 
has since been withdrawn. (TRs cannot remain TRs forever.) Now, JTC1/SC35 began 
work four or five years ago to create data-format specification for this, 
Approved Work Item 30112. From the outset, Unicode and the US national body 
tried repeatedly to engage with SC35 and SC35/WG5, informing them of UTS #35 
(LDML) and CLDR, but were ignored. SC35 didn’t appear to be interested a pet 
project and not in what is actually being used in industry. After several 
failed attempts, Unicode and the USNB gave up trying.

So, any suggestion that Unicode has failed to cooperate or is is dropping the 
ball with regard to locale data and ISO is simply uninformed.


Peter


From: Unicode  On Behalf Of Mark Davis ?? via 
Unicode
Sent: Thursday, June 7, 2018 6:20 AM
To: Marcel Schneider 
Cc: UnicodeMailing 
Subject: Re: The Unicode Standard and ISO

A few facts.

> ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.

ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
speak to the synchronization level in more detail, but the above statement is 
inaccurate.

> ... For another part it [sync with ISO/IEC 15897] failed because the 
> Consortium refused to cooperate, despite of
repeated proposals for a merger of both instances.

I recall no serious proposals for that.

(And in any event — very unlike the synchrony with 10646 and 14651 — ISO 15897 
brought no value to the table. Certainly nothing to outweigh the considerable 
costs of maintaining synchrony. Completely inadequate structure for modern 
system requirement, no particular industry support, and scant content: see 
Wikipedia for "The registry has not been updated since December 2001".)

Mark

Mark

On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode 
mailto:unicode@unicode.org>> wrote:
On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
>
> On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > Hello,
> >
> > There are several mentions of synchronization with related standards in
> > unicode.org<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Funicode.org=02%7C01%7Cpetercon%40microsoft.com%7Cc82f0a9dd1564948d1fe08d5cc7aad2d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636639749650227164=abUYeqt61H7FnzRXvJTy9NMmlk3ySvcMxyQ0bUDsNHc%3D=0>,
> >  e.g. in 
> > https://www.unicode.org/versions/index.html<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.unicode.org%2Fversions%2Findex.html=02%7C01%7Cpetercon%40microsoft.com%7Cc82f0a9dd1564948d1fe08d5cc7aad2d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636639749650237177=jRgnBkmBfcoU9dMrawMXkSpCxLyqz4N6UBgWrg8UZ88%3D=0>,
> >  and
> > https://www.unicode.org/faq/unicode_iso.html<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.unicode.org%2Ffaq%2Funicode_iso.html=02%7C01%7Cpetercon%40microsoft.com%7Cc82f0a9dd1564948d1fe08d5cc7aad2d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636639749650237177=n%2FQc61zUmDnDzSF%2F2mSIXiOeblqnrSs83zRxuKqnuqU%3D=0>.
> >  However, all such mentions
> > never mention anything other than ISO 10646.
>
> Because that is the standard for which there is an explicit understanding by 
> all involved
> relating to synchronization. There have been occasionally some challenging 
> differences
> in the process and procedures, but generally the synchronization is being 
> maintained,
> something that's helped by the fact that so many people are active in both 
> arenas.

Perhaps the cause-effect relationship is somewhat unclear. I think that many 
people being
active in both arenas is helped by the fact that there is a strong will to 
maintain synching.

If there were similar policies notably for ISO/IEC 14651 (collation) and 
ISO/IEC 15897
(locale data), ISO/IEC 10646 would be far from standing alone in the field of
Unicode-ISO/IEC cooperation.

>
> There are really no other standards where the same is true to the same extent.
> >
> > I was wondering which ISO standards other than ISO 10646 specify the
> > same things as the Unicode Standard, and of those, which ones are
> > actively kept in sync. This would be of importance for standardization
> > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > ISO standards is generally preferred in ISO standards.
> >
> One of the areas the Unicode Standard differs from ISO 10646 is that its 
> conception
> of a character's identity implicitly contains that character's properties

Re: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode
On Sat, 9 Jun 2018 21:21:40 -0700, Steven R. Loomis via Unicode wrote:
> 
> Marcel,
> The idea is not necessarily without merit. However, CLDR does not usually 
>expand scope just because of a suggestion.
>
> I usually recommend creating a new project first - gathering data, looking at 
> and talking to projects to ascertain the usefulness
> of common messages.. one of the barriers to adding new content for CLDR is 
> not just the design, but collecting initial data.
> When emoji or sub-territory names were added, many languages were included 
> before it was added to CLDR.

We know it took years to collect the subterritory names and make sure the list 
and translations are complete.

>
> Also note CLDR does have some typographical terms for use in UI, such as 
> 'bold' and 'italic'

I figure out that these are intended for tooltips on basic formatting 
facilities. High-end software like Microsoft Office has many more and adds 
tooltips showing instructions for use out of a corporate strategy that aims at 
raising usability and overall quality. So I wonder whether there are 
limits for software vendors in cooperating with competitors to mutualize UI 
content? 

This point and others would be cleared in the preliminary stage that you 
drafted above but that I don’t feel in a position to carry out, at least 
not now as I’m focusing on our national data in CLDR and on keyboard layouts 
and standards.

Anyhow, Thank you for letting us know.

Best regards,

Marcel


> Regards,
> Steven
>
On Sat, Jun 9, 2018 at 3:41 PM Marcel Schneider via Unicode  wrote:
>
> On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> > 
> > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > > Still a computer should be understandable off-line, so CLDR providing a 
> > > standard library of error messages could be 
> > > appreciated by the industry
> The kind of translations that CLDR accumulates, like day, and month names, 
> language and territory names, are a widely
> > applicable subset and one that is commonly required in machine generated or 
> > machine-assembled text (like displaying
> > the date, providing pick lists for configuration of locale settings, etc).
> > The universe of possible error messages is a completely different beast.
> > If you tried to standardize all error messages even in one language you 
> > would never arrive at something that would be
> > universally useful. While some simple applications may find that all their 
> > needs for communicating with their users are
> > covered, most would wish they had some other messages available.
>

>
…
> 
> > However, a high-quality terminology database recommends itself (and doesn't 
> > need any procurement standards).
> > Ultimately, it was its demonstrated usefulness that drove the adoption of 
> > CLDR.
> 
> This is why I’m so hopeful that CLDR will go much farther than date and time 
> and other locale settings, and emoji names and keywords.
>

>






Re: The Unicode Standard and ISO

2018-06-10 Thread Marcel Schneider via Unicode
On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
[…]
> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name. 
> 
> Especially if it is immediately tied to governmental procurement, forcing 
> people to adopt it (or live with it)
> whether it provides any actual benefit.

Or not. What I left untold is that governmental action does effectively work in 
both directions (examples following),
but governments don’t own that lien of ambivalence out of unbalanced 
discretion. When the French NB positioned 
against encoding Œœ in ISO/IEC 8859-1:1986, it wasn’t the government but a 
manufacturer who wanted to get 
around adding support for this letter in printers. It’s not fully clear to me 
why the same happened to Dutch IJij. 
Anyway as a result we had (and legacy doing the rest, still have) two digitally 
malfunctioning languages.
Thanks to the work of Hugh McGregor Ross, Peter Fenwick, Bernard Marti and Luek 
Zeckendorf (ISO/IEC 6937:1983), 
and from 1987 on thanks to the work of Joe Becker, Lee Collins and Mark Davis 
from Apple and Xerox, things started 
working fine, and do work the longer the better thanks to Mark Davis’ on-going 
commitment.

Industrial and governmental action both are ambivalent by nature simply because 
human action may happen to be 
short-viewed or far-sighted for a variety of reasons. When the French NB issued 
a QWERTY keyboard standard in 1973
and revised it in 1976, there were short-viewed industrial interests rather 
than governmental procurement. End-users 
never adopted it, there was no market, and it has recently been withdrawn. When 
governmental action, hard scientific 
work, human genius and an up-starting industrialization brought into existence 
a working keyboard for French that is 
usefully transposable to many other locales as well, it was enthousiastically 
adopted by the end-users and everybody 
urged the NB to standardize it. But the industry first asked for an 
international keyboard standard as a precondition… 
(which ended up being an excellent idea as well). The rest of the story may be 
spared as the conclusion is already clear.

There is one impractical scheme that bothers me, and that is that we have two 
hyphens because the ASCII hyphen was 
duplicated as U+2010. Now since font designers (e.g. Lucida Sans Unicode) took 
the hyphen conundrum seriously to 
avoid spoofing, or for whatever reason, we’re supposed to have keyboard layouts 
with two hyphens, both being Gc=Pd. 
That is where the related ISO WG2 could have been useful by positioning against 
U+2010, because disambiguating the 
the minus sign U+2212 and keeping the hyphen-minus U+002D in use like e.g. the 
period would have been sufficient.

On the other hand, it is entirely Unicode’s merit that we have two curly 
apostrophes, one that doesn’t break hashtags 
(U+02BC, Gc=Lm), and one that does (U+2019, Gc=Pf), as has been shared on this 
List (thanks to André Schappo). 
But despite a language being in a position to make a distinct use of each one 
of them, depending on whether the 
apostrophe helps denote a particular sound or marks an elision (and despite of 
having already a physical keyboard and 
driver that would make distinct entry very easy and straightforward), 
submitting feedback didn’t help to raise concern 
so far. This is an example how the industry and the governments united in the 
Unicode Consortium are saving end-users 
lots of trouble.

Thank you.

Marcel



Re: The Unicode Standard and ISO

2018-06-09 Thread Steven R. Loomis via Unicode
Marcel,
 The idea is not necessarily without merit. However, CLDR does not usually
expand scope just because of a suggestion.

 I usually recommend creating a new project first - gathering data, looking
at and talking to projects to ascertain the usefulness of common messages..
one of the barriers to adding new content for CLDR is not just the design,
but collecting initial data. When emoji or sub-territory names were added,
many languages were included before it was added to CLDR.

 Also note CLDR does have some typographical terms for use in UI, such as
'bold' and 'italic'

Regards,
Steven

On Sat, Jun 9, 2018 at 3:41 PM Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> >
> > On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > > Still a computer should be understandable off-line, so CLDR providing
> a standard library of error messages could be
> > > appreciated by the industry

> The kind of translations that CLDR accumulates, like day, and month
> names, language and territory names, are a widely
> > applicable subset and one that is commonly required in machine generated
> or machine-assembled text (like displaying
> > the date, providing pick lists for configuration of locale settings,
> etc).
> > The universe of possible error messages is a completely different beast.
> > If you tried to standardize all error messages even in one language you
> would never arrive at something that would be
> > universally useful. While some simple applications may find that all
> their needs for communicating with their users are
> > covered, most would wish they had some other messages available.
>

…
>
> > However, a high-quality terminology database recommends itself (and
> doesn't need any procurement standards).
> > Ultimately, it was its demonstrated usefulness that drove the adoption
> of CLDR.
>
> This is why I’m so hopeful that CLDR will go much farther than date and
> time and other locale settings, and emoji names and keywords.
>


Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode
On Sat, 9 Jun 2018 12:56:28 -0700, Asmus Freytag via Unicode wrote:
> 
> On 6/9/2018 12:01 PM, Marcel Schneider via Unicode wrote:
> > Still a computer should be understandable off-line, so CLDR providing a 
> > standard library of error messages could be 
> > appreciated by the industry.
>
> The kind of translations that CLDR accumulates, like day, and month names, 
> language and territory names, are a widely
> applicable subset and one that is commonly required in machine generated or 
> machine-assembled text (like displaying
> the date, providing pick lists for configuration of locale settings, etc).
> The universe of possible error messages is a completely different beast.
> If you tried to standardize all error messages even in one language you would 
> never arrive at something that would be
> universally useful. While some simple applications may find that all their 
> needs for communicating with their users are
> covered, most would wish they had some other messages available.

Indeed, error messages althouth technical are like the world’s books, a 
never-ending production of content. To account for 
this infinity, I was not proposing a closed set of messages to replace 
application libraries able to display message #123.
In fact I wrote first: “If to date, automatic [automated] translation of 
technical English still does not work, then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece of information.”
Here the piece of information displayed by the application is like a Lego 
spacecraft, the CLDR messages like Lego bricks.
I didn’t play with Lego since a very long time, but as a boy I learned how it 
works. I even remember that when building 
a construct, it often happened that some bricks were “missing”. A Lego box is 
complete wrt one or several models, but 
once my mom showing me the boxes on the shelves explained that they’re composed 
in a way that you’ll always lack 
something [when trying to build further]. — That doesn’t prevent Lego from 
thriving, nor many people from enjoying.

> To adopt your scheme, they would need to have a bifurcated approach, where 
> some messages follow the standard,
> while others do not (cannot). At that point, why bother? Determining whether 
> some message can be rewritten to follow
> the standard adds another level of complexity while you'd need to have 
> translation resources for all the non-standard ones anyway.

When CLDR libraries will allow to generate 98 % well-translated info boxes, 
human translators may focus on the remaining 
2 %. If for any reason they cannot, yet the vendor will get much less support 
requests than with the ill-translated messages.
 
> A middle ground is a shared terminology database that allows translators 
> working on different products to arrive at the same translation
> for the same things. Translators already know how to use such databases in 
> their work flow, and integrating a shared one with
> a product-specific one is much easier than trying to deal with a set of 
> random error messages.

If the scheme you outline works well, where come the reported oddities from? 
Obviously terminology is not all, it’s like Lego bricks without studs:
Terms alone don’t interlock and therefore the user cannot make sense. This is 
where CLDR’s hopefully on-coming localizable message bricks enter 
in action, helping automated translation software compose understandable 
output, using patterns. Google translate is unable to do that, as shown 
in the English and French translations of this sentence found in a page of the 
Finnish NB:
https://www.sfs.fi/ajankohtaista/uutiset/nappaimistoon_tarjolla_lisayksia.4249.news

Finnish: Kielitoimiston ohjeen mukaan esimerkiksi vieraskielisissä nimissä on 
pyrittävä säilyttämään kaikki tarkkeet.
Google English: According to the Language Office, for example, in the name of a 
foreign language, it is necessary to maintain all the checkpoints.
Google French: Selon le Language Office, par exemple, au nom d'une langue 
étrangère, il est nécessaire de maintenir tous les points de contrôle.

> It's pushing this kind of impractical scheme that gives standardizers a bad 
> name. 
> 
> Especially if it is immediately tied to governmental procurement, forcing 
> people to adopt it (or live with it) whether it provides any actual benefit.

These statements make much sense to me…

> However, a high-quality terminology database recommends itself (and doesn't 
> need any procurement standards).
> Ultimately, it was its demonstrated usefulness that drove the adoption of 
> CLDR.

This is why I’m so hopeful that CLDR will go much farther than date and time 
and other locale settings, and emoji names and keywords.

Best regards,

Marcel



Re: The Unicode Standard and ISO

2018-06-09 Thread Asmus Freytag via Unicode

  
  
On 6/9/2018 12:01 PM, Marcel Schneider
  via Unicode wrote:


  Still a computer should be understandable off-line, so CLDR providing a standard library of error messages could be 
appreciated by the industry.

The kind of translations that CLDR accumulates,
like day, and month names, language and territory names, are a
widely applicable subset and one that is commonly required in machine
generated or machine-assembled text (like displaying the date,
providing pick lists for configuration of locale settings, etc).
The universe of possible error messages is a
completely different beast.
If you tried to standardize all error
messages even in one language you would never arrive at
something that would be universally useful. While some simple
applications may find that all their needs for communicating
with their users are covered, most would wish they had some
other messages available.
To adopt your scheme, they would need to
have a bifurcated approach, where some messages follow the standard,
while others do not (cannot). At that point, why bother?
Determining whether some message can be rewritten to follow the
standard adds another level of complexity while you'd need to
have translation resources for all the non-standard ones anyway.
A middle ground is a shared terminology
database that allows translators working on different products
to arrive at the same translation for the same things.
Translators already know how to use such databases in their work
flow, and integrating a shared one with a product-specific one
is much easier than trying to deal with a set of random error
messages.
It's pushing this kind of impractical scheme
that gives standardizers a bad name. 
  
Especially if it is immediately tied to governmental
procurement, forcing people to adopt it (or live with it) whether
it provides any actual benefit.
However, a high-quality terminology database
recommends itself (and doesn't need any procurement standards).
Ultimately, it was its demonstrated
usefulness that drove the adoption of CLDR.
  
A./
  
  



RE: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode
On the other hand, most end-users don’t appreciate to get “a screenfull of 
all-in-English” when “something happened.”
If even big companies still didn’t succeed in getting automatted computer 
translation to work for error messages, then 
best practice could eventually be to provide an internet link with every 
message. Given that web pages are generally 
less sibylline than error messages, they may be better translatable, and 
Philippe Verdy’s hint is therefore a working 
solution for localized software end-user support.

Still a computer should be understandable off-line, so CLDR providing a 
standard library of error messages could be 
appreciated by the industry.

Best regards,

Marcel 

On Sat, 9 Jun 2018 18:14:17 +, Jonathan Rosenne via Unicode wrote:
> 
> Translated error messages are a horror story. Often I have to play around 
> with my locale settings to avoid them.
> Using computer translation on programming error messages is no way near to 
> being useful.
> 
> Best Regards,
> 
> Jonathan Rosenne
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe 
> Verdy via Unicode
> Sent: Saturday, June 09, 2018 7:49 PM
> To: Marcel Schneider
> Cc: UnicodeMailingList
> Subject: Re: The Unicode Standard and ISO
 

 

2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode :
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> > 
> > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> > 
> > > > Where there is opportunity for productive sync and merging with is
> > > > glibc. We have had some discussions, but more needs to be done-
> > > > especially a lot of tooling work. Currently many bug reports are
> > > > duplicated between glibc and cldr, a sort of manual
> > > > synchronization. Help wanted here.  
> > > 
> > > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > > help.
> > 
> > I wonder how much of that comes under the sad category of "better not
> > translated". If an English speaker has to resort to search engines to
> > understand, let alone fix, a reported problem, it may be better for a
> > non-English speaker to search for the error message in English, and then
> > with luck he may find a solution he can understand.
> 
> Then adding a "Display in English" button in the message box is best practice.
> Still I’ve never encountered any yet, and I guess this is because such a 
> facility 
> would be understood as an admission that up to now, i18n is partly a failure.

 


- Navigate any page on the web in another language than yours, with a Google 
Translate plugin enabled on your browser. you'll have the choice of seeing 
the automatic translation or the original.


 


- Many websites that have pages proposed in multiple languages offers such 
buttons to select the language you want to see (and not necesarily falling 
back to English, becausse the original may as well be in another language and 
English is an approximate translation, notably for sites in Asia, Africa and 
south America).


 


- Even the official websites of the European Union (or EEA) offers such choice 
(but at least the available translations are correctly reviewed for European 
languages; not all pages are translated in all official languages of member 
countries, but this is the case for most pages intended to be read by the 
general public, while pages about ongoing works, or technical reports for 
specialists, or recent legal decisions may not be translated except in a few 
"working languages", generally English, German, and French, sometimes Italian, 
the 4 languages spoken officially in multiple countries in the EEA 
including at least one in the European Union).


 


So it's not a "failure" but a feature to be able to select the language, and to 
know when a proposed translation is fully or partly automated.








RE: The Unicode Standard and ISO

2018-06-09 Thread Jonathan Rosenne via Unicode
Translated error messages are a horror story. Often I have to play around with 
my locale settings to avoid them. Using computer translation on programming 
error messages is no way near to being useful.

Best Regards,

Jonathan Rosenne

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Saturday, June 09, 2018 7:49 PM
To: Marcel Schneider
Cc: UnicodeMailingList
Subject: Re: The Unicode Standard and ISO



2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode 
mailto:unicode@unicode.org>>:
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
>
> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> Marcel Schneider via Unicode 
> mailto:unicode@unicode.org>> wrote:
>
> > > Where there is opportunity for productive sync and merging with is
> > > glibc. We have had some discussions, but more needs to be done-
> > > especially a lot of tooling work. Currently many bug reports are
> > > duplicated between glibc and cldr, a sort of manual
> > > synchronization. Help wanted here.
> >
> > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > help.
>
> I wonder how much of that comes under the sad category of "better not
> translated". If an English speaker has to resort to search engines to
> understand, let alone fix, a reported problem, it may be better for a
> non-English speaker to search for the error message in English, and then
> with luck he may find a solution he can understand.

Then adding a "Display in English" button in the message box is best practice.
Still I’ve never encountered any yet, and I guess this is because such a 
facility
would be understood as an admission that up to now, i18n is partly a failure.

- Navigate any page on the web in another language than yours, with a Google 
Translate plugin enabled on your browser. you'll have the choice of seeing the 
automatic translation or the original.

- Many websites that have pages proposed in multiple languages offers such 
buttons to select the language you want to see (and not necesarily falling back 
to English, becausse the original may as well be in another language and 
English is an approximate translation, notably for sites in Asia, Africa and 
south America).

- Even the official websites of the European Union (or EEA) offers such choice 
(but at least the available translations are correctly reviewed for European 
languages; not all pages are translated in all official languages of member 
countries, but this is the case for most pages intended to be read by the 
general public, while pages about ongoing works, or technical reports for 
specialists, or recent legal decisions may not be translated except in a few 
"working languages", generally English, German, and French, sometimes Italian, 
the 4 languages spoken officially in multiple countries in the EEA including at 
least one in the European Union).

So it's not a "failure" but a feature to be able to select the language, and to 
know when a proposed translation is fully or partly automated.


Re: The Unicode Standard and ISO

2018-06-09 Thread Philippe Verdy via Unicode
2018-06-09 17:22 GMT+02:00 Marcel Schneider via Unicode :

> On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> >
> > On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> >
> > > > Where there is opportunity for productive sync and merging with is
> > > > glibc. We have had some discussions, but more needs to be done-
> > > > especially a lot of tooling work. Currently many bug reports are
> > > > duplicated between glibc and cldr, a sort of manual
> > > > synchronization. Help wanted here.
> > >
> > > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > > help.
> >
> > I wonder how much of that comes under the sad category of "better not
> > translated". If an English speaker has to resort to search engines to
> > understand, let alone fix, a reported problem, it may be better for a
> > non-English speaker to search for the error message in English, and then
> > with luck he may find a solution he can understand.
>
> Then adding a "Display in English" button in the message box is best
> practice.
> Still I’ve never encountered any yet, and I guess this is because such a
> facility
> would be understood as an admission that up to now, i18n is partly a
> failure.


- Navigate any page on the web in another language than yours, with a
Google Translate plugin enabled on your browser. you'll have the choice of
seeing the automatic translation or the original.

- Many websites that have pages proposed in multiple languages offers such
buttons to select the language you want to see (and not necesarily falling
back to English, becausse the original may as well be in another language
and English is an approximate translation, notably for sites in Asia,
Africa and south America).

- Even the official websites of the European Union (or EEA) offers such
choice (but at least the available translations are correctly reviewed for
European languages; not all pages are translated in all official languages
of member countries, but this is the case for most pages intended to be
read by the general public, while pages about ongoing works, or technical
reports for specialists, or recent legal decisions may not be translated
except in a few "working languages", generally English, German, and French,
sometimes Italian, the 4 languages spoken officially in multiple countries
in the EEA including at least one in the European Union).

So it's not a "failure" but a feature to be able to select the language,
and to know when a proposed translation is fully or partly automated.


Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode
On Sat, 9 Jun 2018 09:47:01 +0100, Richard Wordingham via Unicode wrote:
> 
> On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
> 
> > > Where there is opportunity for productive sync and merging with is
> > > glibc. We have had some discussions, but more needs to be done-
> > > especially a lot of tooling work. Currently many bug reports are
> > > duplicated between glibc and cldr, a sort of manual
> > > synchronization. Help wanted here.  
> > 
> > Noted. For my part, sadly for C libraries I’m unlikely to be of any
> > help.
> 
> I wonder how much of that comes under the sad category of "better not
> translated". If an English speaker has to resort to search engines to
> understand, let alone fix, a reported problem, it may be better for a
> non-English speaker to search for the error message in English, and then
> with luck he may find a solution he can understand.

Then adding a "Display in English" button in the message box is best practice.
Still I’ve never encountered any yet, and I guess this is because such a 
facility 
would be understood as an admission that up to now, i18n is partly a failure.

> In a related vein,
> one hears reports of people using English as the interface language,
> because they can't understand the messages allegedly in their native
> language.

If to date, automatic translation of technical English still does not work, 
then I’d suggest 
that CLDR feature a complete message library allowing to compose any localized 
piece 
of information. But such an attempt requires that all available human resources 
really 
focus on the project, instead of being diverted by interpersonal discordances. 
Sulking 
people around a project are an indicator of poor project management branding 
dissenters 
as enemies out of an inability to behave in a diplomatic way by lack of social 
skills.
At least that’s what they’d teach you in any management school.

The way Unicode behaves against William Overington is in my opinion a striking 
example 
of mismanagement. In one dimension I can see, the "localizable sentences" that 
William invented and that he actively promotes do fit exactly into the scheme 
of localizable 
information elements suggested in the preceding paragraph. I strongly recommend 
that 
instead of publicly blacklisting the author in the mailbox of the president and 
directing 
the List moderation to prohibit the topic as out of scope of Unicode, an 
extensible and flexible 
framework be designed in urgency under the Unicode‐CLDR umbrella to put an end 
to the 
pseudo‐localization that Richard pointed above.

OK I’m lacking diplomatic skills too, and this e‐mail is harsh, but I see it as 
a true echo.
And I apologize for my last reply to William Overington, if I need to.
http://www.unicode.org/mail-arch/unicode-ml/y2018-m03/0118.html

Beside that, I’d suggest also to add a CLDR library of character name elements 
allowing 
to compose every existing Unicode character name in all supported locales, for 
use in 
system character pickers and special character dialogs. This library should 
then be updated 
at each major release of the UCS. Hopefully this library is then flexible 
enough to avoid 
any Standardese, be it in English, in French, or in any language aping English 
Standardese.
E.g. when the ISO/IEC 10646 mirror of Unicode was published in an official 
French version, 
the official translators felt partly committed to ape English Standardese, of 
which we know 
that it isn’t due mainly to Unicode, but to the then‐head of ISO/IEC JTC1 SC2 
WG2. Not to 
warm up that old grudge, just to show how on‐topic that is. Be it Standardese 
or pseudo‐
localization, the effect is always to worsen UX by missing the point.

Best regards,

Marcel



Re: The Unicode Standard and ISO

2018-06-09 Thread Philippe Verdy via Unicode
I just see the WG2 as a subcomity where governements may just check their
practices and make minimum recommendations. Most governements are in fact
very late to adopt the industry standards that evolve fast, and they just
want to reduce the frequency of necessary changes jsut to enterinate what
seems to be stable enough and gives them long enough period to plan the
transitions. So ISO 10646 has had in fact very few updates compared to
Unicode (even if these Unicode changes were "synchronized", most of them
remained for long within optional amendments that are then synchronized in
ISO 10646 long after the inbdustry has started working on updating their
code for Unicode and made checks to ensure that it is stable enough to be
finally included in ISO 10646 later as the new minimal platform that
governments can reasonnably ask to be provided by their providers in the
industry at reasonnable (or no) additional cost.
So I see now ISO 646 only as a small subset of the Unicode standard. The
WG2 technical comity is jsut there to finally approve what can be endorsed
as a standard whose usage is made mandatory in governments, when the UTS
itself is still (and will remain) just optional (not a requirement). It
takes months or years to have new TUS features being available on all
platforms that governements use. WG2 probably does not focus really on
technical merits, but just evaluating the implementation and deployment
costs, and that's where the WG2 members decide what is reasonnable for them
to adopt (let's also not forget that ISO standards are mapped to national
standards that reference it normatively, and these national standards (or
European standards in the EEA) are legal requirements: governements then no
longer need to specify each time which requirement they want, they're just
saying that the national standards within a certain class are required for
all product/service offers, and failure to implement theses standards will
require those providers to fix their products at no additional cost, and
independantly of the contractual or subscribed period of support).


2018-06-08 23:28 GMT+02:00 Marcel Schneider via Unicode :

> On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:
> >
> […]
> > There's no value added in creating "mirrors" of something that is
> successfully being developed and maintained under a different umbrella.
>
> Wouldn’t the same be true for ISO/IEC 10646? It has no value added
> neither, and WG2 meetings could be merged with UTC meetings.
> Unicode maintains the entire chain, from the roadmap to the production
> tool (that the Consortium ordered without paying a full license).
>
> But the case is about part of the people who are eager to maintain an
> alternate forum, whereas the industry (i.e. the main users of the data)
> are interested in fast‐tracking character batches, and thus tend to
> shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying
> the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason
> why it was not, is that Unicode was weaker and needed support
> from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being
> useless in practice, as it pursued an unrealistic encoding scheme.
> To overcome this, somebody in ISO started actively campaigning for the
> Unicode encoding model, encountering fierce resistance from fellow
> ISO people until he succeeded in teaching them real‐life computing. He had
> already invented and standardized the sorting method later used
> to create UCA and ISO/IEC 14651. I don’t believe that today everybody
> forgot about him.
>
> Marcel
>
>


Re: The Unicode Standard and ISO

2018-06-09 Thread Richard Wordingham via Unicode
On Sat, 9 Jun 2018 08:23:33 +0200 (CEST)
Marcel Schneider via Unicode  wrote:

> > Where there is opportunity for productive sync and merging with is
> > glibc. We have had some discussions, but more needs to be done-
> > especially a lot of tooling work. Currently many bug reports are
> > duplicated between glibc and cldr, a sort of manual
> > synchronization. Help wanted here.   
> 
> Noted. For my part, sadly for C libraries I’m unlikely to be of any
> help.

I wonder how much of that comes under the sad category of "better not
translated".  If an English speaker has to resort to search engines to
understand, let alone fix, a reported problem, it may be better for a
non-English speaker to search for the error message in English, and then
with luck he may find a solution he can understand.  In a related vein,
one hears reports of people using English as the interface language,
because they can't understand the messages allegedly in their native
language.

Richard.



Re: The Unicode Standard and ISO

2018-06-09 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 09:20:09 -0700, Steven R. Loomis via Unicode wrote:
[…]
> But, it sounds like the CLDR process was successful in this case. Thank you 
>for contributing.
 
You are welcome, but thanks are due to the actual corporate contributors.

[…]
> Actually, I think the particular data item you found is relatively new. The 
> first values entered
> for it in any language were May 18th of this year.  Were there votes for 
> "keycap" earlier?

The "keycap" category is found as soon as in v30 (released 2016-10-05).

> Rather than a tracer finding evidence of neglect, you are at the forefront of 
> progressing the translated data for French. Congratulations!

The neglect is on my part as I neglected to check the data history. 
Please note that I did not make accusations of neglect. Again: The historic 
Code Charts translators, partly still active, sulk CLDR 
because Unicode is perceived as sulking ISO/IEC 15897, so that minimal staff is 
actively translating CLDR for the French locale and can 
legitimately feel forsaken. I even made detailed suppositions as of how it 
could happen that "keycap" remained untranslated.
 
[…] [Unanswered questions (please refer to my other e‐mails in this thread)]

> The registry for ISO/IEC 15897 has neither data for French, nor structure 
> that would translate the term "Characters | Category | Label | keycap". 
> So there would be nothing to merge with there.

Correct. The only data for French is an ISO/IEC 646 charset:
http://std.dkuug.dk/cultreg/registrations/number/156
As far as I can see there are available data to merge for Danish, Faroese, 
Finnish Greenlandic, Norwegian, and Swedish.

> So, historically, CLDR began not a part of Unicode, but as part of Li18nx 
> under the Free Standards Group. See the bottom of the page 
> http://cldr.unicode.org/index/acknowledgments
> "The founding members of the workgroup were IBM, Sun and OpenOffice.org". 
> What we were trying to do was to provide internationalized content for Linux, 
> and also, to resolve the then-disparity between locale data
> across platforms. Locale data was very divergent between platforms - spelling 
> and word choice changes, etc.  Comparisons were done
> and a Common locale data repository  (with its attendant XML formats) 
> emerged. That's the C in CLDR. Seed data came from IBM’s ICIR
> which dates many decades before 15897 (example 
> http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/
> - 4th edition published in 1994.) 100 locales we contributed to glibc as well.

Thank you for the account and resources. The Linux Internationalization 
Initiative appears to have issued a last release on August 23, 2000:
https://www.redhat.com/en/about/press-releases/83
the year before ISO/IEC 15897 was lastly updated:
http://std.dkuug.dk/cultreg/registrations/chreg.htm

> Where there is opportunity for productive sync and merging with is glibc. We 
> have had some discussions, but more needs to be
> done- especially a lot of tooling work. Currently many bug reports are 
> duplicated between glibc and cldr, a sort of manual synchronization.
> Help wanted here. 

Noted. For my part, sadly for C libraries I’m unlikely to be of any help.

Marcel



Re: The Unicode Standard and ISO

2018-06-08 Thread Richard Wordingham via Unicode
On Fri, 8 Jun 2018 20:45:26 +0200
Philippe Verdy via Unicode  wrote:

> 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  

> The way tailoring is designed in CLDR using only data used by a
> generic algorithm, and not custom algorithm is not the only way to
> collate Lao. You can perectly add new custom algorithm promitives
> that will use new collation data rules that can be inserted as
> "hooks" in UCA (which provides several points at which it is
> possible, but UCA just makes these hooks act as "no-op".

The ideal is to have a common library rather than add specific routines
to support specific languages.  Now, this can be done in a common
library; ICU break iterators have dedicated routines for CJK and for
Siamese.  I wonder if this could be done for Lao and possibly Tai
Lue.  I've a vague recollection that UCA collation for Tai Lue in the
New Tai Lue script only needs thousands of contractions, so it may work
well enough in the main CLDR collation algorithm.  Martin Hosken
provided the numbers, probably on the Unicore list, when New Tai Lue
formally switched from phonetic to visual order.  Taking the definition
of logical order literally, the change legitimised the logical order of
New Tai Lue. 

> You can be much faster is you create a specific library for Lao, that
> would still be able to process the basic collation rules and then
> make more advanced inferences based on larger cluster boundaries than
> just those considered in the standard basic UCA, so it is perfectly
> possible to extend it to cover more complex Lao syllables and various
> specific quirks (such as hyphenation in the middle of clusters, as
> seen in some Indic scripts using left matras).

How is this hyphenation done?  The answer probably belongs in the
thread entitled 'Hyphenation Markup', unless its restricted to the
visual order scripts.  If it's occurring in the visual order scripts,
we may need to add contractions for ; U+00AD breaks contractions, and, indeed, may be used for
exactly that purpose, as it is generally easier to type than CGJ.
While I've seen line-breaking after a left matra in Thai, I've never
*seen* a hyphen after a left matra.

Richard.


Re: The Unicode Standard and ISO

2018-06-08 Thread Richard Wordingham via Unicode
On Fri, 8 Jun 2018 14:14:51 -0700
"Steven R. Loomis via Unicode"  wrote:

> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR. Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation.  

>  CLDR is not a collation implementation, it is a data repository with
> associated specification. It was never required to 'support' DUCET.
> The contents of CLDR have no bearing on whether implementations
> support DUCET.

DUCET used to be the root collation of CLDR.

> CLDR ≠ ICU.

DUCET is a standard collation.  Language-specific collations are
stored in CLDR, so why not an international standard?  Does ICU store
collations not defined in CLDR?  The formal snag is that the collations
have to be LDML tailorings of the CLDR root collation, which is a
formal problem for U+FDD0.  I would expect you to argue that it is more
useful for U+FDD0 to have the special behaviour defined in CLDR, and
restrict conformance with DUCET to characters other than non-characters.

> On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  

> > On Fri, 8 Jun 2018 13:40:21 +0200
> > Mark Davis ☕️  wrote:

> > > > The UCA contains features essential for respecting canonical
> > > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > > apparently even going to the extreme of implicitly declaring
> > > > that Vietnamese is not a human language.  

> > > A bit over the top, eh?​  
> >
> > Then remove the "no known language" from the bug list

> What does this refer to?

http://userguide.icu-project.org/collation/customization

Under the heading "Known Limitations" it says:

"The following are known limitations of the ICU collation
implementation. These are theoretical limitations, however, since there
are no known languages for which these limitations are an issue.
However, for completeness they should be fixed in a future version
after 1.8.1. The examples given are designed for simplicity in testing,
and do not match any real languages."

Then, the particular problem is listed under the heading "Contractions
Spanning Normalization".  The assumption is that FCD strings do not
need to be decomposed.  This comes unstuck when what is locally a
secondary weight due to a diacritic on a vowel has to be promoted to a
primary weight to support syllable by syllable collation in a system
not set up for such a tiered comparison.

> > …ICU isn't
> > fast enough to load a collation from customisation - it takes
> > hours!  

> > ICU is, alas, ridiculously slow

> I'm also curious what this refers to, perhaps it should be a separate
> ICU bug?

There may be reproducibility issues.  A proper bug report will take some
work.  There's also the argument that nearly 200,000 contractions is
excessive.  I had to disable certain checks that were treating "should
not" as a prohibition - working round them either exceeded ICU's
capacity because of the necessary increase in the number of
contractions, or was incompatible with the design of the collation.

The weight customisation creates 45 new weights, with lines like

"&\u0EA1 = \ufdd2\u0e96 < \ufdd2\u0e97 # MO for THO_H & THO_L"

I use strings like \ufdd2\u0e96 to emulate ISO/IEC 14651
(primary) weights.  I carefully reuse default Lao weights so as to keep
collating elements' list of collation elements short.

There are a total of 187174 non-comment lines, most being simple
contractions like

"&\u0ec8\ufdd2\u0e96\ufdd2AAW\ufdd3\u0e94 = \u0ec8\u0e96\u0ead\u0e94 #
1+K+AW+N  N is mandatory!"

and prefix contractions like

"&\ufdd2AAW\ufdd3\u0e81\u0ec9 = \u0e96\u0ec9 | ອ\u0e81 # K+1|ອ+N
 N is mandatory".

I strip the comments off as I convert the collation definition to
UTF-16; if I remember correctly I also have to convert escape sequences
to characters.  That processing is a negligible part of the time.

By comparison, the loading of 30,000 lines from allkeys.txt is barely
discernible.

The generation of the loading of the collation was reasonably fast when
I generated DUCET-style collation weights using bash.

For my purposes, I would get better performance if ICU's collation just
blindly converted strings to NFD, but then all I am using it for is to
compare collation rules against a dictionary.  I suspect it's just that
I lose out massively as a result of ICU's tradeoffs.

Richard.



Re: The Unicode Standard and ISO

2018-06-08 Thread Asmus Freytag via Unicode

  
  
On 6/8/2018 2:28 PM, Marcel Schneider
  via Unicode wrote:


  On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:

  


  
  […]

  
There's no value added in creating "mirrors" of something that is successfully being developed and maintained under a different umbrella.

  
  
Wouldn’t the same be true for ISO/IEC 10646? It has no value added neither, and WG2 meetings could be merged with UTC meetings.
Unicode maintains the entire chain, from the roadmap to the production tool (that the Consortium ordered without paying a full license).


Without going into a lot of historical detail, the situations are
not comparable; I don't think I agree to the way you summarize
things here, but unfortunately I have not the time to elaborate
further. It suffices to note that 10646 was and is a special case.

Not every attempt at standardization has to happen at ISO. Even on a
treaty level there have always been other organizations, for example
ITU.

Almost the worst thing you can do is duplicating an existing and
well-established effort (by which I mean not a paper effort, but one
that is being implemented widely). Doing so just adds needless
complexity, but it will always satisfy people who are engaging in
the kind of turf-war that makes them feel important.

A./





  

But the case is about part of the people who are eager to maintain an alternate forum, whereas the industry (i.e. the main users of the data) 
are interested in fast‐tracking character batches, and thus tend to shortcut the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying 
the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why it was not, is that Unicode was weaker and needed support 
from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being useless in practice, as it pursued an unrealistic encoding scheme.
To overcome this, somebody in ISO started actively campaigning for the Unicode encoding model, encountering fierce resistance from fellow 
ISO people until he succeeded in teaching them real‐life computing. He had already invented and standardized the sorting method later used 
to create UCA and ISO/IEC 14651. I don’t believe that today everybody forgot about him.

Marcel






  



Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 16:54:20 -0400, Tom Gewecke via Unicode wrote:
> 
> > On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode  wrote:
> > 
> > People relevant to projects for French locale do trace the borderline of 
> > applicability wider 
> > than do those people who are closerly tied to Unicode‐related projects.
> 
> Could you give a concrete example or two of what these people mean by “wider 
> borderline of applicability”
> that might generate their ethical dilemma?
> 

Drawing the borderline until which ISO/IEC should be among the involved 
parties, as I put it, is about the Unicode policy 
as of how ISO/IEC JTC1 SC2 WG2 is involved in the process, how it appears in 
public (FAQs, Mailing List responding practice, 
and so on), and how people in that WG2 feel with respect to Unicode. That may 
be different depending on the standard concerned 
(ISO/IEC 10646, ISO/IEC 14651), so that the former is put in the first place as 
vital to Unicode, while the latter is almost entirely 
hidden (except in appendix B of UTS #10).

Then when it’s up to locale data, Unicode people see the borderline below, 
while ISO people tend to see it above. This is why 
Unicode people do not want the twin‐standards‐bodies‐principle applied to 
locale data, and are ignoring or declining any attempt 
to equalize situations, arguing that ISO/IEC 15897 is useless. As I’ve pointed 
in my previous e‐mail responding to Asmus Freytag, 
ISO/IEC 10646 was about as useless until Unicode came on it and merged itself 
with that UCS embryo (not to say that miscarriage 
on the way). The only thing WG2 could insist upon were names and huge bunches 
of precomposed or preformatted characters that 
Unicode was designed to support in plain text by other means. The essential 
part was Unicode’s, and without Unicode we wouldn’t 
have any usable UCS. ISO/IEC 15897 appears to be in a similar position: not 
very useful, not very performative, not very complete. 
But an ISO/IEC standard. Logically, Unicode should feel committed to merge with 
it the same way it did with the other standard, 
maintaining the data, and publishing periodical abstracts under ISO coverage. 
There is no problem in publishing a framework standard 
under the ISO/IEC umbrella, associated with a regular up‐to‐date snapshot of 
the data.

That is what I mean when I say that Unicode arbitrarily draw borderlines of 
their own, regardless of how people at ISO feel about them.

Marcel



Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 13:33:20 -0700, Asmus Freytag via Unicode wrote:
> 
[…]
> There's no value added in creating "mirrors" of something that is 
> successfully being developed and maintained under a different umbrella.

Wouldn’t the same be true for ISO/IEC 10646? It has no value added neither, and 
WG2 meetings could be merged with UTC meetings.
Unicode maintains the entire chain, from the roadmap to the production tool 
(that the Consortium ordered without paying a full license).

But the case is about part of the people who are eager to maintain an alternate 
forum, whereas the industry (i.e. the main users of the data) 
are interested in fast‐tracking character batches, and thus tend to shortcut 
the ISO/IEC JTC1 SC2 WG2. This is proof enough that applying 
the same logic than to ISO/IEC 15897, WG2 would be eliminated. The reason why 
it was not, is that Unicode was weaker and needed support 
from ISO/IEC to gain enough traction, despite the then‐ISO/IEC 10646 being 
useless in practice, as it pursued an unrealistic encoding scheme.
To overcome this, somebody in ISO started actively campaigning for the Unicode 
encoding model, encountering fierce resistance from fellow 
ISO people until he succeeded in teaching them real‐life computing. He had 
already invented and standardized the sorting method later used 
to create UCA and ISO/IEC 14651. I don’t believe that today everybody forgot 
about him.

Marcel



Re: The Unicode Standard and ISO

2018-06-08 Thread Steven R. Loomis via Unicode
Richard,

> But the consortium has formally dropped the commitment to DUCET in CLDR.
> Even when restricted to strings of assigned characters, the
> CLDR and ICU no longer make the effort to support the DUCET
> collation.

 CLDR is not a collation implementation, it is a data repository with
associated specification. It was never required to 'support' DUCET. The
contents of CLDR have no bearing on whether implementations support DUCET.

CLDR ≠ ICU.

On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 8 Jun 2018 13:40:21 +0200
> Mark Davis ☕️  wrote:
>
> > > The UCA contains features essential for respecting canonical
> > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > apparently even going to the extreme of implicitly declaring that
> > > Vietnamese is not a human language.
>
> > A bit over the top, eh?​
>
> Then remove the "no known language" from the bug list
>

What does this refer to?


>
> …ICU isn't
> fast enough to load a collation from customisation - it takes hours!

…

> ICU is, alas, ridiculously slow
>

I'm also curious what this refers to, perhaps it should be a separate ICU
bug?


Re: The Unicode Standard and ISO

2018-06-08 Thread Tom Gewecke via Unicode


> On Jun 8, 2018, at 9:52 AM, Marcel Schneider via Unicode 
>  wrote:
> 
>  People relevant to projects for French locale do trace the borderline of 
> applicability wider 
> than do those people who are closerly tied to Unicode‐related projects.

Could you give a concrete example or two of what these people mean by “wider 
borderline of applicability” that might generate their ethical dilemma?


Re: The Unicode Standard and ISO

2018-06-08 Thread Asmus Freytag via Unicode

  
  
On 6/8/2018 5:01 AM, Michael Everson
  via Unicode wrote:


  
and achieving a fullscale merger with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. 

  
  I wonder if Mark Davis will be quick to agree with me  when I say that ISO/IEC 15897 has no use and should be withdrawn

I don't know about Mark, but that would have
been my position. 
  
There's no value added in creating "mirrors"
of something that is successfully being developed and maintained
under a different umbrella.
  
A./
  


  



Re: The Unicode Standard and ISO

2018-06-08 Thread Philippe Verdy via Unicode
2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Fri, 8 Jun 2018 13:40:21 +0200
> Mark Davis ☕️  wrote:
>
> > Mark
> >
> > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> > unicode@unicode.org> wrote:
> >
> > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > > Marcel Schneider via Unicode  wrote:
> > >
> > > > Thank you for confirming. All witnesses concur to invalidate the
> > > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > > After being invented in its actual form, sorting was standardized
> > > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > > Algorithm, the latter including practice‐oriented extra
> > > > features.
> > >
> > > The UCA contains features essential for respecting canonical
> > > equivalence.  ICU works hard to avoid the extra effort involved,
> > > apparently even going to the extreme of implicitly declaring that
> > > Vietnamese is not a human language.
>
> > A bit over the top, eh?​
>
> Then remove the "no known language" from the bug list, or declare that
> you don't know SE Asian languages.
>
> The root problem is that the UCA cannot handle syllable by syllable
> comparisons; if the UCA could handle that, the correct collation of
> unambiguous true Lao would become simple.  The CLDR algorithm provides
> just enough memory to make Lao collation possible; however, ICU isn't
> fast enough to load a collation from customisation - it takes hours!
> One could probably do better if one added suffix contractions, but
> adding that capability might be nightmare.


The way tailoring is designed in CLDR using only data used by a generic
algorithm, and not custom algorithm is not the only way to collate Lao. You
can perectly add new custom algorithm promitives that will use new
collation data rules that can be inserted as "hooks" in UCA (which provides
several points at which it is possible, but UCA just makes these hooks act
as "no-op".

You can be much faster is you create a specific library for Lao, that would
still be able to process the basic collation rules and then make more
advanced inferences based on larger cluster boundaries than just those
considered in the standard basic UCA, so it is perfectly possible to extend
it to cover more complex Lao syllables and various specific quirks (such as
hyphenation in the middle of clusters, as seen in some Indic scripts using
left matras).

Not everything has to be specified by UCA itself notably if it's specific
to a script (or sometimes only a single locale, i.e. a specific combination
of a script, language, orthographic convention, and stylistic convention
for some kinds of documents or presentations).


Re: The Unicode Standard and ISO

2018-06-08 Thread Richard Wordingham via Unicode
On Fri, 8 Jun 2018 13:40:21 +0200
Mark Davis ☕️  wrote:

> Mark
> 
> On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> > Marcel Schneider via Unicode  wrote:
> >  
> > > Thank you for confirming. All witnesses concur to invalidate the
> > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > > After being invented in its actual form, sorting was standardized
> > > simultaneously in ISO/IEC 14651 and in Unicode Collation
> > > Algorithm, the latter including practice‐oriented extra
> > > features.  
> >
> > The UCA contains features essential for respecting canonical
> > equivalence.  ICU works hard to avoid the extra effort involved,
> > apparently even going to the extreme of implicitly declaring that
> > Vietnamese is not a human language.  
 
> A bit over the top, eh?​

Then remove the "no known language" from the bug list, or declare that
you don't know SE Asian languages.

The root problem is that the UCA cannot handle syllable by syllable
comparisons; if the UCA could handle that, the correct collation of
unambiguous true Lao would become simple.  The CLDR algorithm provides
just enough memory to make Lao collation possible; however, ICU isn't
fast enough to load a collation from customisation - it takes hours!
One could probably do better if one added suffix contractions, but
adding that capability might be nightmare.

> I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868,
> which nicely outlines a proposal for dealing with a number of
> problems with Vietnamese.

It still includes a brute force work-around.

> We clearly don't support every sorting feature that various
> dictionaries and agencies come up with. Sometimes it is because we
> can't (yet) see a good way to do it:

>1. it might be not determinant: many governmental standards or
> style sheets require "interesting" sorting, such as determining that
> "XI" is a roman numeral (not the president of China) and sorting as
> 11, or when "St." is meant to be Street *and* when meant to be Saint
> (St. Stephen's St.)

I believe the first is a character identity issue.  Some of us
see the difference between U+0058 LATIN CAPITAL LETTER X and the
discouraged U+2169 ROMAN NUMERAL TEN as more than just a round-tripping
difference.  For example, by hand, I write the 'V' in 'Henry V' with a
regnal number quite differently to 'Henry V.' where 'V' is short for a
name.

> > > Since then,
> > > these two standards are kept in synchrony uninterruptedly.  

> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR.  Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation. Indeed, I'm not even sure that the DUCET is a tailoring
> > of the root CLDR collation, even when restricted to assigned
> > characters.  Tailorings tend to have odd side effects; fortunately,
> > they rarely if ever matter. CLDR root is a rewrite with
> > modifications of DUCET; it has changes that are prohibited as
> > 'tailorings'! 

> ​CLDR does make some tailorings to the DUCET to create its root
> collation, ​notably adding special contractions of private use
> characters to allow for tailoring support and indexes [
> http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
> ]  plus the rearrangement of some characters (mostly punctuation and
> symbols) to allow runtime parametric reordering of groups of
> characters (eg to put numbers after letters) [
> http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
> ].

My main point is that for practical purposes (i.e. ICU), Unicode has
moved away from ISO/IEC 14651.  The difference is small.  I didn't say
that there weren't good reasons.

>- If there are other changes that are not well documented, or if
> you think those features are causing problems in some way, please
> file a ticket.

Well, I don't have to use DUCET, though I've found it easier for
unmaintainable tailorings.  I need to write code to apply
non-parametric LDML tailorings - ICU is, alas, ridiculously slow.  I
hope that's just a matter of optimisation balance between compiling a
tailoring and applying it.  Are there any published compliance tests
for non-parametric tailorings?  I'm not sure how one would check that an
alleged parametric reordering of numbers and letters applied to a
tailoring of DUCET was in accordance with the LDML definition, but I
don't think you want to expend money sorting that out. 

>- If there is a particular change that you think is not conformant
> to UCA, please also file that.

Sorry, I must have scanned the conformance requirements too quickly.  I
had got it into my head that someone had recklessly required that
tailorings being in accordance with LDML.  That constraint only applies
to parametric tailorings, so any properly structured unambiguously

Re: The Unicode Standard and ISO

2018-06-08 Thread Steven R. Loomis via Unicode
Marcel,

On Fri, Jun 8, 2018 at 6:52 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:
>
> What got me started is that before even I requested a submitter ID (and
> the reason why I’ve requested one),
> "Characters | Category | Label | keycap" remained untranslated, i.e. its
> French translation was "keycap".
> When I proposed "cabochon", the present contributors kindly upvoted or
> proposed "touche" even before I
> launched a forum thread, and when I got aware, I changed my vote and
> posted the rationale on the forum,
> so the upvoting contributor kindly followed so that now we stay united for
> "touche", rather than "keycap".
>


 But, it sounds like the CLDR process was successful in this case. Thank
you for contributing.


> Please note that I acknowledge everybody and don’t criticize anybody. It
> doesn’t require much imagination
> to figure out that when CLDR was set up, there were so few or even no
> French contributors that translating
> "keycap" either fell out of deadline or was overlooked or whatever, and
> later passed unnoticed. That is a
> tracer detecting that none of the people setting up the French translation
> of the Code Charts were ever on
> the CLDR project. Because if anybody of them had been active on CLDR, no
> English word would have been
> kept in use mistakenly for the French locale.
>

Actually, I think the particular data item you found is relatively new. The
first values entered for it in any language were May 18th of this year.
Were there votes for "keycap" earlier?
Rather than a tracer finding evidence of neglect, you are at the forefront
of progressing the translated data for French. Congratulations!

> French contributors are not "prevented from cooperating". Where do you
get this from? Who do you mean?

>
> Historic French contributors are ethically prevented from contributing to
> CLDR, because of a strong commitment to involve ISO/IEC,
> a notion that is very meaningful to Unicode. People relevant to projects
> for French locale do trace the borderline of applicability wider
> than do those people who are closerly tied to Unicode‐related projects.


Which contributors specifically are prevented?


> > There were not "many attempts" at a merger, and Unicode didn't "refuse"
> anything. Who do you think "attempted", and when?
>
> An influential person consistently campaigned for a merger of CLDR and
> ISO/IEC 15897, but that never succeeded. It’s unlikely to be ignored.


Which person?

> Albeit given the state of ISO/IEC 15897, there was nothing such a merger
> would have contributed anyway.
>
> I’ve took a glance at the data of ISO/IEC 15897 and cannot figure out that
> there is nothing to pick from. At least they won’t be disposed to
> sell you "keycap" as a French term or as being in any use in that target
> locale. And anyhow, the gesture would be appreciated as a piece
> of good diplomacy. Hopefully a lightweight proceeding could end up in that
> data being transferred to CLDR, and this being cited as sole
> normative reference in ISO/IEC 15897. As a result, everybody’s happy.
>

 The registry for ISO/IEC 15897 has neither data for French, nor structure
that would translate the term "Characters | Category | Label | keycap". So
there would be nothing to merge with there.

So, historically, CLDR began not a part of Unicode, but as part of Li18nx
under the Free Standards Group. See the bottom of the page
http://cldr.unicode.org/index/acknowledgments "The founding members of the
workgroup were IBM, Sun and OpenOffice.org".  What we were trying to do was
to provide internationalized content for Linux, and also, to resolve the
then-disparity between locale data across platforms. Locale data was very
divergent between platforms - spelling and word choice changes, etc.
Comparisons were done and a Common locale data repository  (with its
attendant XML formats) emerged. That's the C in CLDR. Seed data came from
IBM’s ICIR which dates many decades before 15897 (example
http://www.computinghistory.org.uk/det/13342/IBM-National-Language-Support-Reference-Manual-Volume-2/
- 4th edition published in 1994.) 100 locales we contributed to glibc as
well.

Where there is opportunity for productive sync and merging with is glibc.
We have had some discussions, but more needs to be done- especially a lot
of tooling work. Currently many bug reports are duplicated between glibc
and cldr, a sort of manual synchronization. Help wanted here.

Steven


Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 08:50:28 -0400, Tom Gewecke via Unicode wrote:
> 
> 
> > On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode  wrote:
> > 
> > What bothered me ... is that the registration of the French locale in CLDR 
> > is 
> > still surprisingly incomplete
> 
> Could you provide an example or two?
> 

What got me started is that "Characters | Category | Label | keycap" remained 
untranslated, i.e. its French translation was "keycap". 

A number of keyword translations are missing or wrong. I can tell that all 
actual contributors are working hard to fix the issues.
I can imagine that it’s by lack of time in front of the huge mass of data, or 
by feeling so alone (only three corporate contributors, 
no liaison or NGOs). No wonder if the official French translators are all 
sulking the job (reportedly, not me figuring out).

Marcel



Re: The Unicode Standard and ISO

2018-06-08 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 13:06:18 +0200, Mark Davis ☕️ via Unicode wrote:
> 
> Where are you getting your "facts"? Among many unsubstantiated or ambiguous 
> claims in that very long sentence:
>
> > "French locale in CLDR is still surprisingly incomplete". 
>
> For each release, the data collected for the French locale is complete to the 
> bar we have set for Level=Modern.

What got me started is that before even I requested a submitter ID (and the 
reason why I’ve requested one), 
"Characters | Category | Label | keycap" remained untranslated, i.e. its French 
translation was "keycap".
When I proposed "cabochon", the present contributors kindly upvoted or proposed 
"touche" even before I 
launched a forum thread, and when I got aware, I changed my vote and posted the 
rationale on the forum, 
so the upvoting contributor kindly followed so that now we stay united for 
"touche", rather than "keycap".

Please note that I acknowledge everybody and don’t criticize anybody. It 
doesn’t require much imagination 
to figure out that when CLDR was set up, there were so few or even no French 
contributors that translating 
"keycap" either fell out of deadline or was overlooked or whatever, and later 
passed unnoticed. That is a 
tracer detecting that none of the people setting up the French translation of 
the Code Charts were ever on 
the CLDR project. Because if anybody of them had been active on CLDR, no 
English word would have been 
kept in use mistakenly for the French locale.

Beyond what everybody on this List is able to decrypt on his or her own, I’m 
not in a position to disclose 
any further personal information, for witness protection’s sake.

> What you may mean is that CLDR doesn't support a structure that you think it 
> should.
> For that, you have to make a compelling case that the structure you propose 
> is worth it, worth diverting people from other priorities.

Thank you, that is not a problem and may be resolved after filing a ticket, 
which would be done for a later release, given that 
top priority tasks require a potentially huge amount of work. First NBSP and 
NNBSP need to be added to the French charset (see
http://unicode.org/cldr/trac/ticket/11120
). Adding centuries to Date (with French short form "s.") is of interest 
for any locale, but irrelevant to everyday business practice.

>
> French contributors are not "prevented from cooperating". Where do you get 
> this from? Who do you mean?

Historic French contributors are ethically prevented from contributing to CLDR, 
because of a strong commitment to involve ISO/IEC, 
a notion that is very meaningful to Unicode. People relevant to projects for 
French locale do trace the borderline of applicability wider 
than do those people who are closerly tied to Unicode‐related projects.

>
> We have many French contribute data over time.

When finding the word "keycap" as a French translation of "keycap" in my copy 
of CLDR data at home, I wanted to know who contributed 
that data. I was told that when survey is open, I’ll see who is contributing. I 
won’t blame those who are helping resolve the issue now.

> Now, it works better when people engage under the umbrella of an 
> organization, but even there that doesn't have to be a company;
> we have liaison relationships with government agencies and NGOs.

That’s fine. But even as a guest I’m well received, and anyhow the point is to 
bring the arguments. 

My concern is that starting with a good translation from scratch is more 
efficient than attempting to correct the same error(s) 
across multiple instances via the survey tool, that seems to be designed to fix 
small errors rather than to redesign entire parts 
of the scheme. 

>
> There were not "many attempts" at a merger, and Unicode didn't "refuse" 
> anything. Who do you think "attempted", and when?

An influential person consistently campaigned for a merger of CLDR and ISO/IEC 
15897, but that never succeeded. It’s unlikely to be ignored.

>
> Albeit given the state of ISO/IEC 15897, there was nothing such a merger 
> would have contributed anyway.

I’ve took a glance at the data of ISO/IEC 15897 and cannot figure out that 
there is nothing to pick from. At least they won’t be disposed to 
sell you "keycap" as a French term or as being in any use in that target 
locale. And anyhow, the gesture would be appreciated as a piece 
of good diplomacy. Hopefully a lightweight proceeding could end up in that data 
being transferred to CLDR, and this being cited as sole 
normative reference in ISO/IEC 15897. As a result, everybody’s happy.

> BTW, your use of the term "refuse" might be a language issue. I don't 
> "refuse" to respond
> to the widow of a Nigerian Prince who wants to give me $1M. Since I don't 
> think it is worth my time,
> or am not willing to upfront the low, low fee of $10K, I might "ignore" the 
> email, or "not respond" to it.
> Or I might "decline" it with a no-thanks or not-interested response. But none 
> of that is to "refuse" 

Re: The Unicode Standard and ISO

2018-06-08 Thread Tom Gewecke via Unicode


> On Jun 7, 2018, at 11:32 PM, Marcel Schneider via Unicode 
>  wrote:
> 
> What bothered me ... is that the registration of the French locale in CLDR is 
> still surprisingly incomplete

Could you provide an example or two?


Re: The Unicode Standard and ISO

2018-06-08 Thread Andrew West via Unicode
On 8 June 2018 at 13:01, Michael Everson via Unicode
 wrote:
>
> I wonder if Mark Davis will be quick to agree with me  when I say that 
> ISO/IEC 15897 has no use and should be withdrawn.

It was reviewed and confirmed in 2017, so the next systematic review
won't be until 2022. And as the standard is now under SC35, national
committees mirroring SC2 may well overlook (or be unable to provide
feedback to) the systematic review when it next comes around. I agree
that ISO/IEC 15897 has no use, and should be withdrawn.

Andrew



Re: The Unicode Standard and ISO

2018-06-08 Thread Michael Everson via Unicode
On 8 Jun 2018, at 04:32, Marcel Schneider via Unicode  
wrote:

> the registration of the French locale in CLDR is still surprisingly 
> incomplete despite the meritorious efforts made by the actual contributors

Nothing prevents people from working to complete the French locale in CLDR. 
Synchronization with an unused ISO standard is not necessary to do that. 

Michael Everson


Re: The Unicode Standard and ISO

2018-06-08 Thread Michael Everson via Unicode
On 7 Jun 2018, at 20:13, Marcel Schneider via Unicode  
wrote:

> On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
>> 
>> It would be great if mutual synchronization were considered to be of benefit.
>> Some of us in SC2 are not happy that the Unicode Consortium has published 
>> characters
>> which are still under Technical ballot. And this did not happen only once. 
> 
> I’m not happy catching up this thread out of time, the less as it ultimately 
> brings me where I’ve started 
> in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger 
> infiltrated into Unicode.

Many things have more than one name. The only truly bad misnomers from that 
period was related to a mapping error, namely, in the treatment of Latvian 
characters which are called CEDILLA rather than COMMA BELOW.

> This is the very thing I did not vent in my first reply. From my point of 
> view, this misfortune would be 
> reason enough for Unicode not to seek further cooperation with ISO/IEC.

This is absolutely NOT what we want. What we want is for the two parties to 
remember that industrial concerns and public concerns work best together. 

> But I remember the many voices raising on this List to tell me that this is 
> all over and forgiven.

I think you are digging up an old grudge that nobody thinks about any longer. 

> Therefore I’m confident that the Consortium will have the mindfulness to 
> complete the ISO/IEC JTC 1 
> partnership by publicly assuming synchronization with ISO/IEC 14651,

There is no trouble with ISO/IEC 14651. 

> and achieving a fullscale merger with ISO/IEC 15897, after which the valid 
> data stay hosted entirely in CLDR, and ISO/IEC 15897 would be its ISO mirror. 

I wonder if Mark Davis will be quick to agree with me  when I say that ISO/IEC 
15897 has no use and should be withdrawn. 

Michael Everson


Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode
Mark

On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
>
> > Thank you for confirming. All witnesses concur to invalidate the
> > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > After being invented in its actual form, sorting was standardized
> > simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> > the latter including practice‐oriented extra features.
>
> The UCA contains features essential for respecting canonical
> equivalence.  ICU works hard to avoid the extra effort involved,
> apparently even going to the extreme of implicitly declaring that
> Vietnamese is not a human language.


A bit over the top, eh?​


> (Some contractions are not
> supported by ICU!)


I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, which
nicely outlines a proposal for dealing with a number of problems with
Vietnamese.

We clearly don't support every sorting feature that various dictionaries
and agencies come up with. Sometimes it is because we can't (yet) see a
good way to do it:

   1. it might be not determinant: many governmental standards or style
   sheets require "interesting" sorting, such as determining that "XI" is a
   roman numeral (not the president of China) and sorting as 11, or when "St."
   is meant to be Street *and* when meant to be Saint (St. Stephen's St.)
   2. the prospective cost in memory, code complexity, or performance, or
   the time necessary to figure out to do complex requirements, doesn't seem
   to warrant adding it at this point​. Now, if you or others are interested
   in proposing specific patches to address certain issues, then you can
   propose that. Best to make a proposal (ticket) before doing the work,
   because if the solution is very intricate, even the time necessary to
   evaluate the patch can be too much to fit into the schedule. For that
   reason, it is best to break up such tickets into small, tractable pieces.

The synchronisation is manifest in the DUCET
> collation, which seems to make the effort to ensure that some canonical
> equivalent will sort the same way under ISO/IEC 14651.
>
> > Since then,
> > these two standards are kept in synchrony uninterruptedly.
>
> But the consortium has formally dropped the commitment to DUCET in
> CLDR.  Even when restricted to strings of assigned characters, the CLDR
> and ICU no longer make the effort to support the DUCET collation.
> Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
> collation, even when restricted to assigned characters.  Tailorings
> tend to have odd side effects; fortunately, they rarely if ever matter.
> CLDR root is a rewrite with modifications of DUCET; it has changes that
> are prohibited as 'tailorings'!
>

​CLDR does make some tailorings to the DUCET to create its root collation,
​notably adding special contractions of private use characters to allow for
tailoring support and indexes [
http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
]  plus the rearrangement of some characters (mostly punctuation and
symbols) to allow runtime parametric reordering of groups of characters (eg
to put numbers after letters) [
http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
].

   - If there are other changes that are not well documented, or if you
   think those features are causing problems in some way, please file a
   ticket.
   - If there is a particular change that you think is not conformant to
   UCA, please also file that.


> Richard.
>
>


Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode
Where are you getting your "facts"? Among many unsubstantiated or ambiguous
claims in that very long sentence:

   1. "French locale in CLDR is still surprisingly incomplete".
  1. For each release, the data collected for the French locale is
  complete to the bar we have set for Level=Modern.
  2. What you may mean is that CLDR doesn't support a structure that
  you think it should. For that, you have to make a compelling
case that the
  structure you propose is worth it, worth diverting people from other
  priorities.
   2. French contributors are not "prevented from cooperating". Where do
   you get this from? Who do you mean?
  1. We have many French contribute data over time. Now, it works
  better when people engage under the umbrella of an organization, but even
  there that doesn't have to be a company; we have liaison
relationships with
  government agencies and NGOs.
   3. There were not "many attempts" at a merger, and Unicode didn't
   "refuse" anything. Who do you think "attempted", and when?
   1. Albeit given the state of ISO/IEC 15897, there was nothing such a
  merger would have contributed anyway.
  2. BTW, your use of the term "refuse" might be a language issue. I
  don't "refuse" to respond to the widow of a Nigerian Prince who wants to
  give me $1M. Since I don't think it is worth my time, or am not
  willing to upfront the low, low fee of $10K, I might "ignore" the
  email, or "not respond" to it. Or I might "decline" it with a
no-thanks or
  not-interested response. But none of that is to "refuse" it.



Mark

On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> >
> > I cannot but fully agree with Mark and Michael.
> >
> > Sincerely
> >
>
> Thank you for confirming. All witnesses concur to invalidate the statement
> about
> uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in
> its
> actual form, sorting was standardized simultaneously in ISO/IEC 14651 and
> in
> Unicode Collation Algorithm, the latter including practice‐oriented extra
> features.
> Since then, these two standards are kept in synchrony uninterruptedly.
>
> Getting people to correct the overall response was not really my initial
> concern,
> however. What bothered me before I learned that Unicode refuses to
> cooperate
> with ISO/IEC JTC1 SC22 is that the registration of the French locale in
> CLDR is
> still surprisingly incomplete despite the meritorious efforts made by the
> actual
> contributors, and then after some investigation, that the main part of the
> potential
> French contributors are prevented from cooperating because Unicode refuses
> to
> cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR,
> reportedly after many attempts made to merge both standards, remaining
> unsuccessful without any striking exposure or friendly agreement to avoid
> kind of
> an impression of unconcerned rebuff.
>
> Best regards,
>
> Marcel
>
>


Re: The Unicode Standard and ISO

2018-06-08 Thread Richard Wordingham via Unicode
On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
Marcel Schneider via Unicode  wrote:

> Thank you for confirming. All witnesses concur to invalidate the
> statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> After being invented in its actual form, sorting was standardized
> simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> the latter including practice‐oriented extra features. 

The UCA contains features essential for respecting canonical
equivalence.  ICU works hard to avoid the extra effort involved,
apparently even going to the extreme of implicitly declaring that
Vietnamese is not a human language. (Some contractions are not
supported by ICU!)  The synchronisation is manifest in the DUCET
collation, which seems to make the effort to ensure that some canonical
equivalent will sort the same way under ISO/IEC 14651.

> Since then,
> these two standards are kept in synchrony uninterruptedly.

But the consortium has formally dropped the commitment to DUCET in
CLDR.  Even when restricted to strings of assigned characters, the CLDR
and ICU no longer make the effort to support the DUCET collation.
Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
collation, even when restricted to assigned characters.  Tailorings
tend to have odd side effects; fortunately, they rarely if ever matter.
CLDR root is a rewrite with modifications of DUCET; it has changes that
are prohibited as 'tailorings'!

Richard.



Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode
On Fri, 8 Jun 2018 00:43:04 +0200, Philippe Verdy via Unicode wrote:
[cited mail]
>
> The "normative names" are in fact normative only as a forward reference
> to the ISO/IEC repertoire becaus it insists that these names are essential 
> part
> of the stable encoding policy which was then integrated in the Unicode 
> stability rules,
> so that the normative reference remains stable as well). Beside this, Unicode 
> has other
> more useful properties. People don't care at all about these names.

Effectively we have learned to live even with those that are uselessly 
misleading and had 
been pushed through against better proposals made on Unicode side, particularly 
the 
wrong left/right attributes. Unicode have worked hard to palliate these 
misnomers by 
introducing the bidi_bracket (yes, no) and bidi_bracket_type (open, close) 
properties, 
and specifying in TUS that beside a few exceptions, LEFT and RIGHT in names of 
paired punctuation is to be read as OPENING and CLOSING, respectively.

> The character properties and the related algorithms that use them (and even
> the representative glyph even if it's not stabilized) are much more important
> (and the ISO/IEC 101646 does not do anything to solve the real encoding 
> issues,
> and needed properties for correct processing). Unicode is more based on 
> commonly
> used practices and allows experimetnation and progressive enhancing without 
> having
> to break the agreed ISO/EIC normative properties. The position of Unicode is 
> more
> pragmatic, and is much more open to lot of contibutors than the small ISO/IEC 
> subcomities
> with in fact very few active members, but it's still an interesting 
> counter-power that allows
> governments to choose where it is more useful to contribute and have 
> influence when
> the industry may have different needs and practices not foàllowing the 
> government
> recommendations adopted at ISO.

Now it becomes clear to me that this opportunity of governmental action is 
exactly what 
could be useful when it’s up to fix the textual appearance of national user 
interfaces, and 
that is exactly why not federating communities around CLDR, and not attempting 
to get 
efforts converge, is so counter‐productive.

Thanks for getting this point out.

Best regards,

Marcel



RE: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode
On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> 
> I cannot but fully agree with Mark and Michael.
> 
> Sincerely
> 

Thank you for confirming. All witnesses concur to invalidate the statement 
about 
uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in its 
actual form, sorting was standardized simultaneously in ISO/IEC 14651 and in 
Unicode Collation Algorithm, the latter including practice‐oriented extra 
features. 
Since then, these two standards are kept in synchrony uninterruptedly.

Getting people to correct the overall response was not really my initial 
concern, 
however. What bothered me before I learned that Unicode refuses to cooperate 
with ISO/IEC JTC1 SC22 is that the registration of the French locale in CLDR is 
still surprisingly incomplete despite the meritorious efforts made by the 
actual 
contributors, and then after some investigation, that the main part of the 
potential 
French contributors are prevented from cooperating because Unicode refuses to 
cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR, 
reportedly after many attempts made to merge both standards, remaining
unsuccessful without any striking exposure or friendly agreement to avoid kind 
of 
an impression of unconcerned rebuff.

Best regards,

Marcel



Re: The Unicode Standard and ISO

2018-06-07 Thread Philippe Verdy via Unicode
2018-06-07 21:13 GMT+02:00 Marcel Schneider via Unicode :

> On Thu, 17 May 2018 22:26:15 +, Peter Constable via Unicode wrote:
> […]
> > Hence, from an ISO perspective, ISO 10646 is the only standard for which
> on-going
> > synchronization with Unicode is needed or relevant.
>
> This point of view is fueled by the Unicode Standard being traditionally
> thought of as a mere character set,
> regardless of all efforts—lastly by first responder Asmus Freytag
> himself—to widen the conception.
>
> On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
> >
> > It would be great if mutual synchronization were considered to be of
> benefit.
> > Some of us in SC2 are not happy that the Unicode Consortium has
> published characters
> > which are still under Technical ballot. And this did not happen only
> once.
>
> I’m not happy catching up this thread out of time, the less as it
> ultimately brings me where I’ve started
> in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger
> infiltrated into Unicode.
> This is the very thing I did not vent in my first reply. From my point of
> view, this misfortune would be
> reason enough for Unicode not to seek further cooperation with ISO/IEC.
>

The "normative names" are in fact normative only as a forward reference to
the ISO/IEC repertoire becaus it insists that these names are essential
part of the stable encoding policy which was then integrated in the Unicode
stability rules, so that the normative reference remains stable as well).
Beside this, Unicode has other more useful properties. People don't care at
all about these names. The character properties and the related algorithms
that use them (and even the representative glyph even if it's not
stabilized) are much more important (and the ISO/IEC 101646 does not do
anything to solve the real encoding issues, and needed properties for
correct processing). Unicode is more based on commonly used practices and
allows experimetnation and progressive enhancing without having to break
the agreed ISO/EIC normative properties. The position of Unicode is more
pragmatic, and is much more open to lot of contibutors than the small
ISO/IEC subcomities with in fact very few active members, but it's still an
interesting counter-power that allows governments to choose where it is
more useful to contribute and have influence when the industry may have
different needs and practices not foàllowing the government recommendations
adopted at ISO.


RE: The Unicode Standard and ISO

2018-06-07 Thread via Unicode
I cannot but fully agree with Mark and Michael.

Sincerely

Erkki I. Kolehmainen
Mannerheimintie 75 B 37, 00270 Helsinki, Finland
Mob: +358 400 825 943 

-Alkuperäinen viesti-
Lähettäjä: Unicode  Puolesta Michael Everson via 
Unicode
Lähetetty: torstai 7. kesäkuuta 2018 16.29
Vastaanottaja: unicode Unicode Discussion 
Aihe: Re: The Unicode Standard and ISO

On 7 Jun 2018, at 14:20, Mark Davis ☕️ via Unicode  wrote:
> 
> A few facts. 
> 
>> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above statement is 
> inaccurate.

Mark is right. 

>> > ... For another part it [sync with ISO/IEC 15897] failed because the 
>> > Consortium refused to cooperate, despite of repeated proposals for a 
>> > merger of both instances.
> 
> I recall no serious proposals for that. 

Nor do I.

> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
> considerable costs of maintaining synchrony. Completely inadequate structure 
> for modern system requirement, no particular industry support, and scant 
> content: see Wikipedia for "The registry has not been updated since December 
> 2001”.)

Mark is right.

Michael Everson




Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode
On Thu, 17 May 2018 22:26:15 +, Peter Constable via Unicode wrote:
[…]
> Hence, from an ISO perspective, ISO 10646 is the only standard for which 
> on-going
> synchronization with Unicode is needed or relevant. 

This point of view is fueled by the Unicode Standard being traditionally 
thought of as a mere character set, 
regardless of all efforts—lastly by first responder Asmus Freytag himself—to 
widen the conception.

On Fri, 18 May 2018 00:29:36 +0100, Michael Everson via Unicode responded:
>
> It would be great if mutual synchronization were considered to be of benefit.
> Some of us in SC2 are not happy that the Unicode Consortium has published 
> characters
> which are still under Technical ballot. And this did not happen only once. 

I’m not happy catching up this thread out of time, the less as it ultimately 
brings me where I’ve started 
in 2014/2015: to the wrong character names that the ISO/IEC 10646 merger 
infiltrated into Unicode.
This is the very thing I did not vent in my first reply. From my point of view, 
this misfortune would be 
reason enough for Unicode not to seek further cooperation with ISO/IEC.

But I remember the many voices raising on this List to tell me that this is all 
over and forgiven.
Therefore I’m confident that the Consortium will have the mindfulness to 
complete the ISO/IEC JTC 1 
partnership by publicly assuming synchronization with ISO/IEC 14651, and 
achieving a fullscale merger 
with ISO/IEC 15897, after which the valid data stay hosted entirely in CLDR, 
and ISO/IEC 15897 would 
be its ISO mirror. 

That is a matter of smart diplomacy, that Unicode may prove again to be great 
in.

Please consider making this move.

Thanks,

Marcel



Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode
On Thu, 7 Jun 2018 15:20:29 +0200, Mark Davis ☕️ via Unicode wrote:
> 
> A few facts. 
>
> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
>
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the
> synchronization level in more detail, but the above statement is inaccurate.
>
> > ... For another part it [sync with ISO/IEC 15897] failed because the 
> > Consortium refused to
> > cooperate, despite of repeated proposals for a merger of both instances.
> 
> I recall no serious proposals for that. 
> 
> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought
> no value to the table. Certainly nothing to outweigh the considerable costs 
> of maintaining synchrony.
> Completely inadequate structure for modern system requirement, no particular 
> industry support, and
> scant content: see Wikipedia for "The registry has not been updated since 
> December 2001".)



Thank you for correcting as of the Unicode ISO/IEC 14651 synchrony; indeed 
while on

http://www.unicode.org/reports/tr10/#Synch_ISO14651

we can read that “This relationship between the two standards is similar to 
that maintained between
the Unicode Standard and ISO/IEC 10646[,]” confusingly there seems to be no 
related FAQ. Even more 
confusingly, a straightforward question like “I was wondering which ISO 
standards other than ISO 10646 
specify the same things as the Unicode Standard” remains ultimately unanswered. 

The reason might be that the “and of those, which ones are actively kept in 
sync” part is really best 
answered by “none.” In fact, while UCA is synched with ISO/IEC 14651, the 
reverse statement is 
reportedly false. Hence, UCA would be what is called an implementation of 
ISO/IEC 14651.

Nevertheless, UAX #10 refers to “The synchronized version of ISO/IEC 14651[,]” 
and mentions a 
“common tool[.]” 

Hence one simple question: Why does the fact that the Unicode-ISO synchrony 
encompasses *two* 
standards remain untold in the first places?


As of ISO/IEC 15897, it would certainly be a piece of good diplomacy that 
Unicode pick the usable 
data in the existing set, and then ISO/IEC 15897 will be in a position to cite 
CLDR as a normative 
reference so that all potential contributors are redirected and may feel free 
to contribute to CLDR.

And it would be nice that Unicode don’t forget to order an additional FAQ about 
the topic, please.

Thanks,

Marcel



Re: The Unicode Standard and ISO

2018-06-07 Thread Michael Everson via Unicode
On 7 Jun 2018, at 14:20, Mark Davis ☕️ via Unicode  wrote:
> 
> A few facts. 
> 
>> > ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.
> 
> ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could 
> speak to the synchronization level in more detail, but the above statement is 
> inaccurate.

Mark is right. 

>> > ... For another part it [sync with ISO/IEC 15897] failed because the 
>> > Consortium refused to cooperate, despite of repeated proposals for a 
>> > merger of both instances.
> 
> I recall no serious proposals for that. 

Nor do I.

> (And in any event — very unlike the synchrony with 10646 and 14651 — ISO 
> 15897 brought no value to the table. Certainly nothing to outweigh the 
> considerable costs of maintaining synchrony. Completely inadequate structure 
> for modern system requirement, no particular industry support, and scant 
> content: see Wikipedia for "The registry has not been updated since December 
> 2001”.)

Mark is right.

Michael Everson


Re: The Unicode Standard and ISO

2018-06-07 Thread Mark Davis ☕️ via Unicode
A few facts.

> ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.

ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could
speak to the synchronization level in more detail, but the above statement
is inaccurate.

> ... For another part it [sync with ISO/IEC 15897] failed because the
Consortium refused to cooperate, despite of
repeated proposals for a merger of both instances.

I recall no serious proposals for that.

(And in any event — very unlike the synchrony with 10646 and 14651 — ISO 15897
brought no value to the table. Certainly nothing to outweigh the
considerable costs of maintaining synchrony. Completely inadequate
structure for modern system requirement, no particular industry support,
and scant content: see Wikipedia for "The registry has not been updated
since December 2001".)

Mark

Mark

On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
> >
> > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > > Hello,
> > >
> > > There are several mentions of synchronization with related standards in
> > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> > > https://www.unicode.org/faq/unicode_iso.html. However, all such
> mentions
> > > never mention anything other than ISO 10646.
> >
> > Because that is the standard for which there is an explicit
> understanding by all involved
> > relating to synchronization. There have been occasionally some
> challenging differences
> > in the process and procedures, but generally the synchronization is
> being maintained,
> > something that's helped by the fact that so many people are active in
> both arenas.
>
> Perhaps the cause-effect relationship is somewhat unclear. I think that
> many people being
> active in both arenas is helped by the fact that there is a strong will to
> maintain synching.
>
> If there were similar policies notably for ISO/IEC 14651 (collation) and
> ISO/IEC 15897
> (locale data), ISO/IEC 10646 would be far from standing alone in the field
> of
> Unicode-ISO/IEC cooperation.
>
> >
> > There are really no other standards where the same is true to the same
> extent.
> > >
> > > I was wondering which ISO standards other than ISO 10646 specify the
> > > same things as the Unicode Standard, and of those, which ones are
> > > actively kept in sync. This would be of importance for standardization
> > > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > > ISO standards is generally preferred in ISO standards.
> > >
> > One of the areas the Unicode Standard differs from ISO 10646 is that its
> conception
> > of a character's identity implicitly contains that character's
> properties - and those are
> > standardized as well and alongside of just name and serial number.
>
> This is probably why, to date, ISO/IEC 10646 features character properties
> by including
> normative references to the Unicode Standard, Standard Annexes, and the
> UCD.
> Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1:
>
> “[…] The list of these characters is determined by having the
> ‘Bidi_Mirrored’ property
> set to ‘Y’ in the Unicode Standard. These values shall be determined
> according to
> the Unicode Standard Bidi Mirrored property (see Clause 2).”
>
> >
> > Many of these properties have associated with them algorithms, e.g. the
> bidi algorithm,
> > that are an essential element of data interchange: if you don't know
> which order in
> > the backing store is expected by the recipient to produce a certain
> display order, you
> > cannot correctly prepare your data.
> >
> > There is one area where standardization in ISO relates to work in
> Unicode that I can
> > think of, and that is sorting.
>
> Yet UCA conforms to ISO/IEC 14651 (where UCA is cited as entry #28 in the
> bibliography).
> The reverse relationship is irrelevant and would be unfair, given that the
> Consortium
> refused till now to synchronize UCA and ISO/IEC 14651.
>
> Here is a need for action.
>
> > However, sorting, beyond the underlying framework,
> > ultimately relates to languages, and language-specific data is now
> housed in CLDR.
> >
> > Early attempts by ISO to standardize a similar framework for locale data
> failed, in
> > part because the framework alone isn't the interesting challenge for a
> repository,
> > instead it is the collection, vetting and management of the data.
>
> For another part it failed because the Consortium refused to cooperate,
> despite of
> repeated proposals for a merger of both instances.
>
> >
> > The reality is that the ISO model and its organizational structures are
> not well suited
> > to the needs of many important area where some form of standardization
> is needed.
> > That's why we have organization like IETF, W3C, Unicode etc..
> >
> > Duplicating all or even part of their effort inside ISO really serves
> nobody's 

Re: The Unicode Standard and ISO

2018-06-07 Thread Marcel Schneider via Unicode
On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
> 
> On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > Hello,
> >
> > There are several mentions of synchronization with related standards in
> > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> > https://www.unicode.org/faq/unicode_iso.html. However, all such mentions
> > never mention anything other than ISO 10646.
> 
> Because that is the standard for which there is an explicit understanding by 
> all involved
> relating to synchronization. There have been occasionally some challenging 
> differences
> in the process and procedures, but generally the synchronization is being 
> maintained,
> something that's helped by the fact that so many people are active in both 
> arenas.

Perhaps the cause-effect relationship is somewhat unclear. I think that many 
people being 
active in both arenas is helped by the fact that there is a strong will to 
maintain synching.

If there were similar policies notably for ISO/IEC 14651 (collation) and 
ISO/IEC 15897 
(locale data), ISO/IEC 10646 would be far from standing alone in the field of 
Unicode-ISO/IEC cooperation.

> 
> There are really no other standards where the same is true to the same extent.
> >
> > I was wondering which ISO standards other than ISO 10646 specify the
> > same things as the Unicode Standard, and of those, which ones are
> > actively kept in sync. This would be of importance for standardization
> > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > ISO standards is generally preferred in ISO standards.
> >
> One of the areas the Unicode Standard differs from ISO 10646 is that its 
> conception
> of a character's identity implicitly contains that character's properties - 
> and those are
> standardized as well and alongside of just name and serial number.

This is probably why, to date, ISO/IEC 10646 features character properties by 
including 
normative references to the Unicode Standard, Standard Annexes, and the UCD.
Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1:

“[…] The list of these characters is determined by having the ‘Bidi_Mirrored’ 
property 
set to ‘Y’ in the Unicode Standard. These values shall be determined according 
to 
the Unicode Standard Bidi Mirrored property (see Clause 2).”

> 
> Many of these properties have associated with them algorithms, e.g. the bidi 
> algorithm,
> that are an essential element of data interchange: if you don't know which 
> order in
> the backing store is expected by the recipient to produce a certain display 
> order, you
> cannot correctly prepare your data.
> 
> There is one area where standardization in ISO relates to work in Unicode 
> that I can
> think of, and that is sorting.

Yet UCA conforms to ISO/IEC 14651 (where UCA is cited as entry #28 in the 
bibliography).
The reverse relationship is irrelevant and would be unfair, given that the 
Consortium
refused till now to synchronize UCA and ISO/IEC 14651.

Here is a need for action.

> However, sorting, beyond the underlying framework,
> ultimately relates to languages, and language-specific data is now housed in 
> CLDR.
> 
> Early attempts by ISO to standardize a similar framework for locale data 
> failed, in
> part because the framework alone isn't the interesting challenge for a 
> repository,
> instead it is the collection, vetting and management of the data.

For another part it failed because the Consortium refused to cooperate, despite 
of 
repeated proposals for a merger of both instances.

> 
> The reality is that the ISO model and its organizational structures are not 
> well suited
> to the needs of many important area where some form of standardization is 
> needed.
> That's why we have organization like IETF, W3C, Unicode etc..
> 
> Duplicating all or even part of their effort inside ISO really serves 
> nobody's purpose.

An undesirable side-effect of not merging Unicode with ISO/IEC 15897 (locale 
data) is 
to divert many competent contributors from monitoring CLDR data, especially for 
French.

Here too is a huge need for action.

Thanks in advance.

Marcel



Re: The Unicode Standard and ISO

2018-05-17 Thread Michael Everson via Unicode
It would be great if mutual synchronization were considered to be of benefit. 
Some of us in SC2 are not happy that the Unicode Consortium has published 
characters which are still under Technical ballot. And this did not happen only 
once.

> On 17 May 2018, at 23:26, Peter Constable via Unicode  
> wrote:
> 
> Hence, from an ISO perspective, ISO 10646 is the only standard for which 
> on-going synchronization with Unicode is needed or relevant.




RE: The Unicode Standard and ISO

2018-05-17 Thread Peter Constable via Unicode
ISO character encoding standards are primarily focused on identifying a 
repertoire of character elements and their code point assignments in some 
encoding form. ISO developed other, legacy character-encoding standards in the 
past, but has not done so for over 20 years. All of those legacy standards can 
be mapped as a bijection to ISO 10646; in regard to character repertoires, they 
are all proper subsets of ISO 10646. 

Hence, from an ISO perspective, ISO 10646 is the only standard for which 
on-going synchronization with Unicode is needed or relevant.


Peter

-Original Message-
From: Unicode  On Behalf Of Martinho Fernandes via 
Unicode
Sent: Thursday, May 17, 2018 8:08 AM
To: unicode@unicode.org
Subject: The Unicode Standard and ISO

Hello,

There are several mentions of synchronization with related standards in 
unicode.org, e.g. in https://www.unicode.org/versions/index.html, and 
https://www.unicode.org/faq/unicode_iso.html. However, all such mentions never 
mention anything other than ISO 10646.

I was wondering which ISO standards other than ISO 10646 specify the same 
things as the Unicode Standard, and of those, which ones are actively kept in 
sync. This would be of importance for standardization of Unicode facilities in 
the C++ language (ISO 14882), as reference to ISO standards is generally 
preferred in ISO standards.

--
Martinho





Re: The Unicode Standard and ISO

2018-05-17 Thread Asmus Freytag via Unicode

On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:

Hello,

There are several mentions of synchronization with related standards in
unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
https://www.unicode.org/faq/unicode_iso.html. However, all such mentions
never mention anything other than ISO 10646.


Because that is the standard for which there is an explicit 
understanding by all involved
relating to synchronization. There have been occasionally some 
challenging differences
in the process and procedures, but generally the synchronization is 
being maintained,
something that's helped by the fact that so many people are active in 
both arenas.


There are really no other standards where the same is true to the same 
extent.


I was wondering which ISO standards other than ISO 10646 specify the
same things as the Unicode Standard, and of those, which ones are
actively kept in sync. This would be of importance for standardization
of Unicode facilities in the C++ language (ISO 14882), as reference to
ISO standards is generally preferred in ISO standards.

One of the areas the Unicode Standard differs from ISO 10646 is that its 
conception
of a character's identity implicitly contains that character's 
properties - and those are

standardized as well and alongside of just name and serial number.

Many of these properties have associated with them algorithms, e.g. the 
bidi algorithm,
that are an essential element of data interchange: if you don't know 
which order in
the backing store is expected by the recipient to produce a certain 
display order, you

cannot correctly prepare your data.

There is one area where standardization in ISO relates to work in 
Unicode that I can
think of, and that is sorting. However, sorting, beyond the underlying 
framework,
ultimately relates to languages, and language-specific data is now 
housed in CLDR.


Early attempts by ISO to standardize a similar framework for locale data 
failed, in
part because the framework alone isn't the interesting challenge for a 
repository,

instead it is the collection, vetting and management of the data.

The reality is that the ISO model and its organizational structures are 
not well suited
to the needs of many important area where some form of standardization 
is needed.

That's why we have organization like IETF, W3C, Unicode etc..

Duplicating all or even part of their effort inside ISO really serves 
nobody's purpose.


A./