Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Richard Wordingham via Unicode
On Mon, 22 May 2017 17:19:08 -0500
Anshuman Pandey  wrote:

> I performed several operations on DerivedAge.txt a few months ago.
> One basic example here:
> 
> https://pandey.github.io/posts/unicode-growth-UCD-python.html

So what happens if you apply it to Unicode Version 10.0?  Are the
versions sorted as strings, as real numbers, or just in the order of
the data in DerivedAge.txt.

> If you provide some more insight into your objective, I might be able
> to help.

One of the objectives is to use a current version of the UCD to
determine, for example, which characters were in Version x.y.  One
needs that for a regular expression such as [:Age=3.0:], which
also matches all characters that have survived since Version 1.1.
Another is to record for which versions of the standard a character had
some particular value of a property.

Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Richard Wordingham via Unicode
On Mon, 22 May 2017 15:10:02 -0700
Markus Scherer via Unicode  wrote:

> On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > Given two raw values of the Age property, defined in UCD file
> > DerivedAge.txt, how is a computer program supposed to compare them?
> > Apart from special handling for the value "Unassigned" and its short
> > alias "NA", one used to be able to compare short values against
> > short values and long values against long values by simple string
> > comparison.  However, now we are coming to Version 10.0 of Unicode,
> > this no longer works - "1.1" < "10.0" < "2.0".
> >  
> 
> This is normal for numbers, and for multi-field version numbers.
> If you want numeric sorting, then you need to either use a collator
> with that option, or parse the versions into tuples of integers and
> sort those.

Well, comparing "15.1" and "15.12" gives different answers depending on
whether you view them as decimal numbers or a hierarchical sequence of
numbers.

> Can one rely on the FULL STOP being the field
> > divider,  
 
> I think so. Dots are extremely common for version numbers. I see no
> reason for Unicode to use something else.

But where is that stated?

> and can one rely on there never being any grouping characters
> > in the short values?  
 
> I don't know what "grouping characters" you have in mind.

Comma is the obvious one.

Looking to the far future (I trust you've heard of the predicted Cobol
crisis for the Y10k problem), will we have "1000.0" or "1,000.0"?

Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Anshuman Pandey via Unicode
I performed several operations on DerivedAge.txt a few months ago. One basic 
example here:

https://pandey.github.io/posts/unicode-growth-UCD-python.html

If you provide some more insight into your objective, I might be able to help.

I would recommend against relying on the order of the data, and that you 
instead parse the individual entries to obtain the 'Age' property.

All my best,
Anshu


> On May 22, 2017, at 4:44 PM, Richard Wordingham via Unicode 
>  wrote:
> 
> Given two raw values of the Age property, defined in UCD file
> DerivedAge.txt, how is a computer program supposed to compare them?
> Apart from special handling for the value "Unassigned" and its short
> alias "NA", one used to be able to compare short values against short
> values and long values against long values by simple string
> comparison.  However, now we are coming to Version 10.0 of Unicode,
> this no longer works - "1.1" < "10.0" < "2.0".
> 
> There are some possibilities - the values appear in order in
> PropertyValueAliases.txt and in DerivedAge.txt.  However, I can find no
> relevant guarantees in UAX#44.  I am looking for a solution that can be
> driven by the data files, rather than requiring human thought at every
> version release.  Can one rely on the FULL STOP being the field
> divider, and can one rely on there never being any grouping characters
> in the short values?  Again, I could find no guarantees.
> 
> Richard.


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Markus Scherer via Unicode
On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Given two raw values of the Age property, defined in UCD file
> DerivedAge.txt, how is a computer program supposed to compare them?
> Apart from special handling for the value "Unassigned" and its short
> alias "NA", one used to be able to compare short values against short
> values and long values against long values by simple string
> comparison.  However, now we are coming to Version 10.0 of Unicode,
> this no longer works - "1.1" < "10.0" < "2.0".
>

This is normal for numbers, and for multi-field version numbers.
If you want numeric sorting, then you need to either use a collator with
that option, or parse the versions into tuples of integers and sort those.

There are some possibilities - the values appear in order in
> PropertyValueAliases.txt and in DerivedAge.txt.


You should not rely on the order of values in data files, unless the file
explicitly states that order matters.

Can one rely on the FULL STOP being the field
> divider,


I think so. Dots are extremely common for version numbers. I see no reason
for Unicode to use something else.

and can one rely on there never being any grouping characters
> in the short values?


I don't know what "grouping characters" you have in mind.

I think the format is pretty self-evident.

markus


Comparing Raw Values of the Age Property

2017-05-22 Thread Richard Wordingham via Unicode
Given two raw values of the Age property, defined in UCD file
DerivedAge.txt, how is a computer program supposed to compare them?
Apart from special handling for the value "Unassigned" and its short
alias "NA", one used to be able to compare short values against short
values and long values against long values by simple string
comparison.  However, now we are coming to Version 10.0 of Unicode,
this no longer works - "1.1" < "10.0" < "2.0".

There are some possibilities - the values appear in order in
PropertyValueAliases.txt and in DerivedAge.txt.  However, I can find no
relevant guarantees in UAX#44.  I am looking for a solution that can be
driven by the data files, rather than requiring human thought at every
version release.  Can one rely on the FULL STOP being the field
divider, and can one rely on there never being any grouping characters
in the short values?  Again, I could find no guarantees.

Richard.


Conference marking 40th anniversary of Niamey expert meeting?

2017-05-22 Thread Don Osborn via Unicode
Is there any interest in a conference on support for African languages,
including issues at the character and script level? I'm looking at the
upcoming 40th anniversary of the Niamey expert meeting on "Transcription and
Harmonization of African Languages" with the thought that it might be an
opportune occasion to take stock of a process that was prominent in the
1960s - 1970s, reflecting/shaping the Latin-based orthographies used today,
and consider current issues with all scripts used in Africa. Such an event
could also serve as a way to exchange skills and network among people doing
applied work (localization, content development, language technology).

 

I've just posted a short question to that effect at
http://niamey.blogspot.com/2017/05/marking-40th-anniversary-of-niamey.html
in the hopes of eliciting feedback. This post also references 2 earlier
postings about the 50th anniversary of the landmark 1966 Bamako expert
meeting, in which various possible issues for discussion were mentioned.

 

The 1978 Niamey conference was a key meeting among a series of
UNESCO-(co)sponsored expert meetings on harmonization of transcriptions
(orthographies) in Latin script during the 1960s and 1970s. Among other
things, this conference produced the African Reference Alphabet, which has
been referred to in standardization of orthographies in several countries
and in much later discussions relating to Unicode.

 

Thanks in advance for any feedback, here or on the blog.

 

Don Osborn, PhD