Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Richard Wordingham via Unicode
On Tue, 23 May 2017 17:44:49 -0700
Ken Whistler via Unicode  wrote:
 
> Ah, but keep in mind, if projecting out to Version 23.0 (in the year 
> 2030, by our current schedule), there is a significant chance that 
> particular UCD data files may have morphed into something entirely 
> different. Recall how at one point Unihan.txt morphed into Unihan.zip 
> with multiple subpart files. Even though the maintainers of the UCD
> data files do our best to maintain them to be as stable as possible,
> their content and sometimes their formats do morph gradually from
> release to release. Just don't expect *any* parser to be completely
> forward proofed against what *might* happen in the UCD in some future
> version.

So long as the parser chokes on the new input, that is not too bad for
my programs, which rely on being directed to a local copy of the
UCD.  That issue would be nastier for any program that tries to keep
abreast of Unicode additions by downloading the relevant parts of the
UCD.

> On the other hand, for the property Age, even in the absence of 
> normative definitions of invariants for the property values, given 
> recent practice, it is pretty damn safe to assume:

> A. Major versions will continue to have two digits, incremented by
> one for each subsequent version: 10, 11, 12, ... 99.
> B. Minor versions will mostly (if not entirely) consist of the value 
> "0", and will never require two digits.

> Assumption A will get you through this century, which by my
> estimation should well exceed the lifetime of any code you might be
> writing now that depends on it.

Yes, but
http://www.thejokeshop.org/2008/12/as-useful-as-a-cobol-programmer/ .

> BTW, unlike many actual products, the version numbering of the
> Unicode Standard is not really driven by marketing concerns. So there
> is very little chance of some version sequence for Unicode that ends
> up fitting a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista,
> 7, 8, 8.1, 10 ... ;-)

The risk I saw was that someone would decide to deprecate value names
that look like floating point numbers, so that the relevant value for
Version 17.0.0 would be named V17_0 and have no aliases.

The new text in UAX#14 is also proof against the major version numbers
suddenly becoming the year numbers, as has happened with several
products.

> Yes. You could always file another piece of feedback using the
> contact form. However, in this case, you already have the attention
> of the editors of UAX #44. So my advice would be to simply wait now
> for the publication of Version 10.0 of UAX #44 around the 3rd week of
> June.

What deterred me was:
(a) "The beta review period for Unicode 10.0 and
related technical standards will close on May 1, 2017. This is the last
opportunity for technical comments before version 10.0 is released in
Q2 2017." -
http://blog.unicode.org/2017/04/last-call-on-unicode-100-beta-review.html

and

(b) Proposed changes aren't yet part of the Unicode standard.

Richard.


Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Ken Whistler via Unicode

Richard


On 5/23/2017 1:48 PM, Richard Wordingham via Unicode wrote:

The object is to generate code*now*  that, up to say Unicode Version 23.0,
can work out, from the UCD files DerivedAge.txt and
PropertyValueAliases.txt, whether an arbitrary code point was included
by some Unicode version identified by a Unicode version identified by a
value of the property Age.


Ah, but keep in mind, if projecting out to Version 23.0 (in the year 
2030, by our current schedule), there is a significant chance that 
particular UCD data files may have morphed into something entirely 
different. Recall how at one point Unihan.txt morphed into Unihan.zip 
with multiple subpart files. Even though the maintainers of the UCD data 
files do our best to maintain them to be as stable as possible, their 
content and sometimes their formats do morph gradually from release to 
release. Just don't expect *any* parser to be completely forward proofed 
against what *might* happen in the UCD in some future version.


On the other hand, for the property Age, even in the absence of 
normative definitions of invariants for the property values, given 
recent practice, it is pretty damn safe to assume:


A. Major versions will continue to have two digits, incremented by one 
for each subsequent version: 10, 11, 12, ... 99.
B. Minor versions will mostly (if not entirely) consist of the value 
"0", and will never require two digits.


Assumption A will get you through this century, which by my estimation 
should well exceed the lifetime of any code you might be writing now 
that depends on it.


BTW, unlike many actual products, the version numbering of the Unicode 
Standard is not really driven by marketing concerns. So there is very 
little chance of some version sequence for Unicode that ends up fitting 
a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista, 7, 8, 8.1, 10 
... ;-)




What TUS 9.0, its appendices and annexes is lacking is a clear
statement such as, "The short values for the Age property are of the
form "m.n", with the first field corresponding to the major version,
and the second field corresponding to the minor version. There is no
need for a third version field, because new characters are never
assigned in update versions of the standard."


I think the UTC and the editors had just been assuming that the pattern 
was so obvious that it needed no explaining. But the lack of a clear 
description of Age had become apparent, which is why I wrote that text 
to add to UAX #44 for the upcoming version.



  Conveniently, this
almost true statement is included in Section 5.14 of the proposed
update to UAX#44 (in Draft 12 to be precise.  It's not quite true, for
there is also the short value NA for Unassigned.  Is there any way of
formally recording this oversight?


Yes. You could always file another piece of feedback using the contact 
form. However, in this case, you already have the attention of the 
editors of UAX #44. So my advice would be to simply wait now for the 
publication of Version 10.0 of UAX #44 around the 3rd week of June.


--Ken




Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Richard Wordingham via Unicode
On Tue, 23 May 2017 05:29:33 -0700
Asmus Freytag via Unicode  wrote:

> On 5/23/2017 4:04 AM, Janusz S. Bien via Unicode wrote:
> > Quote/Cytat - Manuel Strehl via Unicode  (Tue
> > 23 May 2017 11:33:24 AM CEST):
> >  
> >> The rising standard in the world of web development (and others)
> >> is called
> >> »Semantic Versioning« [1], that many projects adhere to or
> >> sometimes must
> >> actively explain, why they don't.
> >>
> >> The structure of a »semantic version« string is a set of three
> >> integers, MAJOR.MINOR.PATCH, where the »sematics« part lies in a
> >> kind of contract between author and user, when to increment which
> >> part. 
> >
> > Perhaps I am missing something, but I don't understand this thread.
> > Cf.  
> 
> You are not missing anything, the OP is being obtuse. We just didn't 
> want to run the search for him. :)

The object is to generate code *now* that, up to say Unicode Version 23.0,
can work out, from the UCD files DerivedAge.txt and
PropertyValueAliases.txt, whether an arbitrary code point was included
by some Unicode version identified by a Unicode version identified by a
value of the property Age.  One needs this capability to implement
the regular expressions of the form \p{Age=xxx}.  This requires a scheme
for determining which of two values of the property identifies the
earlier version of Unicode.

What TUS 9.0, its appendices and annexes is lacking is a clear
statement such as, "The short values for the Age property are of the
form "m.n", with the first field corresponding to the major version,
and the second field corresponding to the minor version. There is no
need for a third version field, because new characters are never
assigned in update versions of the standard."  Conveniently, this
almost true statement is included in Section 5.14 of the proposed
update to UAX#44 (in Draft 12 to be precise.  It's not quite true, for
there is also the short value NA for Unassigned.  Is there any way of
formally recording this oversight?

With this proposed change, to compare two values, all one has to do
is compare the short names of the values, for one knows what form they
will be in.

> > Version numbers for the Unicode Standard consist of three fields, 
> > denoting the major version, the minor version, and the update
> > version, respectively.

Yes, but 4.0.1 is not a value of the property Age; the last field is
redundant.  Oddly enough, ICU understands the regular expression
\p{age=4.0.1}, but not \p{age=V2_1}
(http://demo.icu-project.org/icu-bin/redemo).  Ah well, it's only a
recommendation that regular expression engines understand both short
names and long names of values of properties.

Richard.



Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Philippe Verdy via Unicode
2017-05-23 8:43 GMT+02:00 Asmus Freytag via Unicode :

> On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote:
>
>> One of the objectives is to use a current version of the UCD to
>> determine, for example, which characters were in Version x.y.  One
>> needs that for a regular expression such as [:Age=3.0:], which
>> also matches all characters that have survived since Version 1.1.
>> Another is to record for which versions of the standard a character had
>> some particular value of a property.
>>
>
> Richard,
>
> I would tend to side with those who claim that "version number" is
> something that's defined by common industry practice, and therefore not
> something that Unicode needs to define - but is allowed to use. Just like
> Unicode doesn't define what an integer is, or hexadecimal number system or
> a whole host of other concepts that are used in defining in turn what
> Unicode is.
>
> As Markus implied, version numbers are a positional number system where
> the positions in turn are integers in decimal notation, separated by dots.
>

Not all version numbers obey this scheme with dots and only integers. There
are also version numbers using dates (separated by hyphens like in the ISO
format), or additional letters (a,b,c...) or labels (alpha, beta, RC)
sometimes in the middle of other fields (these labels are not always easy
to compare), but they are generally made to be case-insensitive and tend to
avoid non-latin letters, so Greek letters are named in Latin), and they
cannot be always parsed and combined as a single integer.
For comparing/sorting, it's best to use case-ensensitive and use only
primary differences in UCA. But the UCA algorithm should be tweaked using
preparsing to locate where there are numbers
In rare cases you may find roman decimal numbers (I, II,III, IV, V, IX, X)
which can't be strictly sorted like other Latin letters.


Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Asmus Freytag via Unicode

On 5/23/2017 4:04 AM, Janusz S. Bien via Unicode wrote:
Quote/Cytat - Manuel Strehl via Unicode  (Tue 23 
May 2017 11:33:24 AM CEST):


The rising standard in the world of web development (and others) is 
called
»Semantic Versioning« [1], that many projects adhere to or sometimes 
must

actively explain, why they don't.

The structure of a »semantic version« string is a set of three integers,
MAJOR.MINOR.PATCH, where the »sematics« part lies in a kind of contract
between author and user, when to increment which part.



Perhaps I am missing something, but I don't understand this thread. Cf.


You are not missing anything, the OP is being obtuse. We just didn't 
want to run the search for him. :)

A./


http://unicode.org/versions/

Version numbers for the Unicode Standard consist of three fields, 
denoting the major version, the minor version, and the update version, 
respectively.


The differences between major, minor, and update versions are as follows:

[...]

Best regards

Janusz





Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Janusz S. Bien via Unicode
Quote/Cytat - Manuel Strehl via Unicode  (Tue 23  
May 2017 11:33:24 AM CEST):



The rising standard in the world of web development (and others) is called
»Semantic Versioning« [1], that many projects adhere to or sometimes must
actively explain, why they don't.

The structure of a »semantic version« string is a set of three integers,
MAJOR.MINOR.PATCH, where the »sematics« part lies in a kind of contract
between author and user, when to increment which part.



Perhaps I am missing something, but I don't understand this thread. Cf.

http://unicode.org/versions/

Version numbers for the Unicode Standard consist of three fields,  
denoting the major version, the minor version, and the update version,  
respectively.


The differences between major, minor, and update versions are as follows:

[...]

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Manuel Strehl via Unicode
The rising standard in the world of web development (and others) is called
»Semantic Versioning« [1], that many projects adhere to or sometimes must
actively explain, why they don't.

The structure of a »semantic version« string is a set of three integers,
MAJOR.MINOR.PATCH, where the »sematics« part lies in a kind of contract
between author and user, when to increment which part.

I do _not_ suggest Unicode to embrace that standard, merely stating, that
that is what many frontend developers will simply assume when looking at a
version string, that matches this pattern.

--Manuel

[1] http://semver.org/

2017-05-23 8:43 GMT+02:00 Asmus Freytag via Unicode :

> On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote:
>
>> One of the objectives is to use a current version of the UCD to
>> determine, for example, which characters were in Version x.y.  One
>> needs that for a regular expression such as [:Age=3.0:], which
>> also matches all characters that have survived since Version 1.1.
>> Another is to record for which versions of the standard a character had
>> some particular value of a property.
>>
>
> Richard,
>
> I would tend to side with those who claim that "version number" is
> something that's defined by common industry practice, and therefore not
> something that Unicode needs to define - but is allowed to use. Just like
> Unicode doesn't define what an integer is, or hexadecimal number system or
> a whole host of other concepts that are used in defining in turn what
> Unicode is.
>
> As Markus implied, version numbers are a positional number system where
> the positions in turn are integers in decimal notation, separated by dots.
>
> As it is neither a "string" nor a single number, neither of those common
> sorting methods give the right answer, but a multi-field sort will.
>
> If you have a multi-field sort algorithm that uses commas as the
> delimiter, just swap out the dots for commas. If not, then you have to
> implement your own multi-level sort.
>
> In any well-designed modern runtime library you can pass a comparison
> method to any of the sorting algorithms (or sorted data collections).
>
> A./
>
> PS: somewhere in the standard, Unicode does define names for the fields:
> Major, Minor and Update. The use of the term "Update" may not be universal,
> but major and minor version numbers are a well established concept and do
> not need a definition. The naming also implies the order of precedence.
>


Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Asmus Freytag via Unicode

On 5/22/2017 3:49 PM, Richard Wordingham via Unicode wrote:

One of the objectives is to use a current version of the UCD to
determine, for example, which characters were in Version x.y.  One
needs that for a regular expression such as [:Age=3.0:], which
also matches all characters that have survived since Version 1.1.
Another is to record for which versions of the standard a character had
some particular value of a property.


Richard,

I would tend to side with those who claim that "version number" is 
something that's defined by common industry practice, and therefore not 
something that Unicode needs to define - but is allowed to use. Just 
like Unicode doesn't define what an integer is, or hexadecimal number 
system or a whole host of other concepts that are used in defining in 
turn what Unicode is.


As Markus implied, version numbers are a positional number system where 
the positions in turn are integers in decimal notation, separated by dots.


As it is neither a "string" nor a single number, neither of those common 
sorting methods give the right answer, but a multi-field sort will.


If you have a multi-field sort algorithm that uses commas as the 
delimiter, just swap out the dots for commas. If not, then you have to 
implement your own multi-level sort.


In any well-designed modern runtime library you can pass a comparison 
method to any of the sorting algorithms (or sorted data collections).


A./

PS: somewhere in the standard, Unicode does define names for the fields: 
Major, Minor and Update. The use of the term "Update" may not be 
universal, but major and minor version numbers are a well established 
concept and do not need a definition. The naming also implies the order 
of precedence.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Richard Wordingham via Unicode
On Mon, 22 May 2017 17:19:08 -0500
Anshuman Pandey  wrote:

> I performed several operations on DerivedAge.txt a few months ago.
> One basic example here:
> 
> https://pandey.github.io/posts/unicode-growth-UCD-python.html

So what happens if you apply it to Unicode Version 10.0?  Are the
versions sorted as strings, as real numbers, or just in the order of
the data in DerivedAge.txt.

> If you provide some more insight into your objective, I might be able
> to help.

One of the objectives is to use a current version of the UCD to
determine, for example, which characters were in Version x.y.  One
needs that for a regular expression such as [:Age=3.0:], which
also matches all characters that have survived since Version 1.1.
Another is to record for which versions of the standard a character had
some particular value of a property.

Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Richard Wordingham via Unicode
On Mon, 22 May 2017 15:10:02 -0700
Markus Scherer via Unicode  wrote:

> On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > Given two raw values of the Age property, defined in UCD file
> > DerivedAge.txt, how is a computer program supposed to compare them?
> > Apart from special handling for the value "Unassigned" and its short
> > alias "NA", one used to be able to compare short values against
> > short values and long values against long values by simple string
> > comparison.  However, now we are coming to Version 10.0 of Unicode,
> > this no longer works - "1.1" < "10.0" < "2.0".
> >  
> 
> This is normal for numbers, and for multi-field version numbers.
> If you want numeric sorting, then you need to either use a collator
> with that option, or parse the versions into tuples of integers and
> sort those.

Well, comparing "15.1" and "15.12" gives different answers depending on
whether you view them as decimal numbers or a hierarchical sequence of
numbers.

> Can one rely on the FULL STOP being the field
> > divider,  
 
> I think so. Dots are extremely common for version numbers. I see no
> reason for Unicode to use something else.

But where is that stated?

> and can one rely on there never being any grouping characters
> > in the short values?  
 
> I don't know what "grouping characters" you have in mind.

Comma is the obvious one.

Looking to the far future (I trust you've heard of the predicted Cobol
crisis for the Y10k problem), will we have "1000.0" or "1,000.0"?

Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Anshuman Pandey via Unicode
I performed several operations on DerivedAge.txt a few months ago. One basic 
example here:

https://pandey.github.io/posts/unicode-growth-UCD-python.html

If you provide some more insight into your objective, I might be able to help.

I would recommend against relying on the order of the data, and that you 
instead parse the individual entries to obtain the 'Age' property.

All my best,
Anshu


> On May 22, 2017, at 4:44 PM, Richard Wordingham via Unicode 
>  wrote:
> 
> Given two raw values of the Age property, defined in UCD file
> DerivedAge.txt, how is a computer program supposed to compare them?
> Apart from special handling for the value "Unassigned" and its short
> alias "NA", one used to be able to compare short values against short
> values and long values against long values by simple string
> comparison.  However, now we are coming to Version 10.0 of Unicode,
> this no longer works - "1.1" < "10.0" < "2.0".
> 
> There are some possibilities - the values appear in order in
> PropertyValueAliases.txt and in DerivedAge.txt.  However, I can find no
> relevant guarantees in UAX#44.  I am looking for a solution that can be
> driven by the data files, rather than requiring human thought at every
> version release.  Can one rely on the FULL STOP being the field
> divider, and can one rely on there never being any grouping characters
> in the short values?  Again, I could find no guarantees.
> 
> Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Markus Scherer via Unicode
On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Given two raw values of the Age property, defined in UCD file
> DerivedAge.txt, how is a computer program supposed to compare them?
> Apart from special handling for the value "Unassigned" and its short
> alias "NA", one used to be able to compare short values against short
> values and long values against long values by simple string
> comparison.  However, now we are coming to Version 10.0 of Unicode,
> this no longer works - "1.1" < "10.0" < "2.0".
>

This is normal for numbers, and for multi-field version numbers.
If you want numeric sorting, then you need to either use a collator with
that option, or parse the versions into tuples of integers and sort those.

There are some possibilities - the values appear in order in
> PropertyValueAliases.txt and in DerivedAge.txt.


You should not rely on the order of values in data files, unless the file
explicitly states that order matters.

Can one rely on the FULL STOP being the field
> divider,


I think so. Dots are extremely common for version numbers. I see no reason
for Unicode to use something else.

and can one rely on there never being any grouping characters
> in the short values?


I don't know what "grouping characters" you have in mind.

I think the format is pretty self-evident.

markus