Re: Comparing Raw Values of the Age Property
On Mon, 22 May 2017 17:19:08 -0500 Anshuman Pandeywrote: > I performed several operations on DerivedAge.txt a few months ago. > One basic example here: > > https://pandey.github.io/posts/unicode-growth-UCD-python.html So what happens if you apply it to Unicode Version 10.0? Are the versions sorted as strings, as real numbers, or just in the order of the data in DerivedAge.txt. > If you provide some more insight into your objective, I might be able > to help. One of the objectives is to use a current version of the UCD to determine, for example, which characters were in Version x.y. One needs that for a regular expression such as [:Age=3.0:], which also matches all characters that have survived since Version 1.1. Another is to record for which versions of the standard a character had some particular value of a property. Richard.
Re: Comparing Raw Values of the Age Property
On Mon, 22 May 2017 15:10:02 -0700 Markus Scherer via Unicodewrote: > On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > > Given two raw values of the Age property, defined in UCD file > > DerivedAge.txt, how is a computer program supposed to compare them? > > Apart from special handling for the value "Unassigned" and its short > > alias "NA", one used to be able to compare short values against > > short values and long values against long values by simple string > > comparison. However, now we are coming to Version 10.0 of Unicode, > > this no longer works - "1.1" < "10.0" < "2.0". > > > > This is normal for numbers, and for multi-field version numbers. > If you want numeric sorting, then you need to either use a collator > with that option, or parse the versions into tuples of integers and > sort those. Well, comparing "15.1" and "15.12" gives different answers depending on whether you view them as decimal numbers or a hierarchical sequence of numbers. > Can one rely on the FULL STOP being the field > > divider, > I think so. Dots are extremely common for version numbers. I see no > reason for Unicode to use something else. But where is that stated? > and can one rely on there never being any grouping characters > > in the short values? > I don't know what "grouping characters" you have in mind. Comma is the obvious one. Looking to the far future (I trust you've heard of the predicted Cobol crisis for the Y10k problem), will we have "1000.0" or "1,000.0"? Richard.
Re: Comparing Raw Values of the Age Property
I performed several operations on DerivedAge.txt a few months ago. One basic example here: https://pandey.github.io/posts/unicode-growth-UCD-python.html If you provide some more insight into your objective, I might be able to help. I would recommend against relying on the order of the data, and that you instead parse the individual entries to obtain the 'Age' property. All my best, Anshu > On May 22, 2017, at 4:44 PM, Richard Wordingham via Unicode >wrote: > > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its short > alias "NA", one used to be able to compare short values against short > values and long values against long values by simple string > comparison. However, now we are coming to Version 10.0 of Unicode, > this no longer works - "1.1" < "10.0" < "2.0". > > There are some possibilities - the values appear in order in > PropertyValueAliases.txt and in DerivedAge.txt. However, I can find no > relevant guarantees in UAX#44. I am looking for a solution that can be > driven by the data files, rather than requiring human thought at every > version release. Can one rely on the FULL STOP being the field > divider, and can one rely on there never being any grouping characters > in the short values? Again, I could find no guarantees. > > Richard.
Re: Comparing Raw Values of the Age Property
On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its short > alias "NA", one used to be able to compare short values against short > values and long values against long values by simple string > comparison. However, now we are coming to Version 10.0 of Unicode, > this no longer works - "1.1" < "10.0" < "2.0". > This is normal for numbers, and for multi-field version numbers. If you want numeric sorting, then you need to either use a collator with that option, or parse the versions into tuples of integers and sort those. There are some possibilities - the values appear in order in > PropertyValueAliases.txt and in DerivedAge.txt. You should not rely on the order of values in data files, unless the file explicitly states that order matters. Can one rely on the FULL STOP being the field > divider, I think so. Dots are extremely common for version numbers. I see no reason for Unicode to use something else. and can one rely on there never being any grouping characters > in the short values? I don't know what "grouping characters" you have in mind. I think the format is pretty self-evident. markus
Comparing Raw Values of the Age Property
Given two raw values of the Age property, defined in UCD file DerivedAge.txt, how is a computer program supposed to compare them? Apart from special handling for the value "Unassigned" and its short alias "NA", one used to be able to compare short values against short values and long values against long values by simple string comparison. However, now we are coming to Version 10.0 of Unicode, this no longer works - "1.1" < "10.0" < "2.0". There are some possibilities - the values appear in order in PropertyValueAliases.txt and in DerivedAge.txt. However, I can find no relevant guarantees in UAX#44. I am looking for a solution that can be driven by the data files, rather than requiring human thought at every version release. Can one rely on the FULL STOP being the field divider, and can one rely on there never being any grouping characters in the short values? Again, I could find no guarantees. Richard.
Conference marking 40th anniversary of Niamey expert meeting?
Is there any interest in a conference on support for African languages, including issues at the character and script level? I'm looking at the upcoming 40th anniversary of the Niamey expert meeting on "Transcription and Harmonization of African Languages" with the thought that it might be an opportune occasion to take stock of a process that was prominent in the 1960s - 1970s, reflecting/shaping the Latin-based orthographies used today, and consider current issues with all scripts used in Africa. Such an event could also serve as a way to exchange skills and network among people doing applied work (localization, content development, language technology). I've just posted a short question to that effect at http://niamey.blogspot.com/2017/05/marking-40th-anniversary-of-niamey.html in the hopes of eliciting feedback. This post also references 2 earlier postings about the 50th anniversary of the landmark 1966 Bamako expert meeting, in which various possible issues for discussion were mentioned. The 1978 Niamey conference was a key meeting among a series of UNESCO-(co)sponsored expert meetings on harmonization of transcriptions (orthographies) in Latin script during the 1960s and 1970s. Among other things, this conference produced the African Reference Alphabet, which has been referred to in standardization of orthographies in several countries and in much later discussions relating to Unicode. Thanks in advance for any feedback, here or on the blog. Don Osborn, PhD