Character set cluelessness
The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
RE: Character set cluelessness
I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. Jony From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of john knightley Sent: Tuesday, October 02, 2012 6:35 PM To: Doug Ewell Cc: unicode@unicode.org; loc...@unece.org Subject: Re: Character set cluelessness Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell -
RE: Character set cluelessness
Jonathan Rosenne jonathan dot rosenne at gmail dot com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. I don't mind so much if they have to maintain an ASCII-only name field, or a Latin-1-only field. But referencing the 1993 version of ISO 10646-1, or claiming that MS-DOS code page 437 is the standard United States character set in 2012 and that it conforms to 8859-1 and 10646, helps nobody. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
On 2 Oct 2012, at 21:40, Doug Ewell wrote: But referencing the 1993 version of ISO 10646-1, or claiming that MS-DOS code page 437 is the standard United States character set in 2012 and that it conforms to 8859-1 and 10646, helps nobody. It's not that it helps nobody. It's just that it's WRONG. Michael Everson * http://www.evertype.com/
Re: Character set cluelessness
I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
RE: Character set cluelessness
Mark Davis mark at macchiato dot com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. They have two name fields, one (Name) for the name transliterated into Latin, and a second (NameWoDiacritics) which is an ASCII-smashed version of the first. Again, that's fine as long as I am free to ignore the ASCII version. They don't attempt to represent names in non-Latin scripts, which is not my beef here. There are many names in the Name (i.e. beyond ASCII) field that include characters beyond 8859-1, such as œ and ̆z, and certainly many beyond CP437. This is a good thing (although there are some errors, not as many as in past years), but they need to fix their documentation to reflect what they actually do, and not make these irrelevant, misleading, and/or inaccurate references to 437 and to a 19-year-old version of 10646. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell shy;
Re: Character set cluelessness
On Tue, 2 Oct 2012 13:49:57 -0700 Mark Davis ☕ m...@macchiato.com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Academic Romanisation would be useful as well! Richard.
Re: Character set cluelessness
Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
And just to be clear, I do agree that their documentation of the standards usage, well, needs improvement. I'm just talking about the actual data, and for that as a practical matter it is valuable to have both the native language version(s) of a name, and a Latin equivalent. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 2:52 PM, Mark Davis ☕ m...@macchiato.com wrote: Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
On Tue, 02 Oct 2012 09:14:08 -0700 Doug Ewell d...@ewellic.org wrote: It's 2012. How does one get through to folks like this? Even people who should know better can get confused about character sets. Does anyone know what 'a complex script Unicode range' is? It's a term that occurs in the Office Open XML specification, but I can't find a definition for it. It's just possible that it means a range where hypothetically unassigned characters would not be left-to-right, but I've a feeling it ought to include Vietnamese characters for all that they're Latin script. Possibly the definitions have not been provided because the concept ought to involve the tricky task of breaking text runs into script runs. (Lots of people feel one should be able to add script-specific combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7 MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X. U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil scripts, to name but a few.) Richard.
Re: Character set cluelessness
Richard - Complex script usually refers to scripts where rendering isn't just simply putting glyphs side by side. That includes stuff with combining marks, ligatures, reordering, stacking, and the like. Regards, Martin. On 2012/10/03 7:09, Richard Wordingham wrote: On Tue, 02 Oct 2012 09:14:08 -0700 Doug Ewelld...@ewellic.org wrote: It's 2012. How does one get through to folks like this? Even people who should know better can get confused about character sets. Does anyone know what 'a complex script Unicode range' is? It's a term that occurs in the Office Open XML specification, but I can't find a definition for it. It's just possible that it means a range where hypothetically unassigned characters would not be left-to-right, but I've a feeling it ought to include Vietnamese characters for all that they're Latin script. Possibly the definitions have not been provided because the concept ought to involve the tricky task of breaking text runs into script runs. (Lots of people feel one should be able to add script-specific combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7 MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X. U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil scripts, to name but a few.) Richard.
Re: Character set cluelessness
So in order to get something going here, why doesn't Doug draft a letter to these guys (possibly based on the one from a few years ago) and then Mark sends it off in his position at Unicode, which hopefully will impress them more than just a personal contribution. Being upset in this list (which I'm too, of course) doesn't change anything. Regards, Martin. On 2012/10/03 6:15, Doug Ewell wrote: Mark Davis mark at macchiato dot com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. They have two name fields, one (Name) for the name transliterated into Latin, and a second (NameWoDiacritics) which is an ASCII-smashed version of the first. Again, that's fine as long as I am free to ignore the ASCII version. They don't attempt to represent names in non-Latin scripts, which is not my beef here. There are many names in the Name (i.e. beyond ASCII) field that include characters beyond 8859-1, such as œ and ̆z, and certainly many beyond CP437. This is a good thing (although there are some errors, not as many as in past years), but they need to fix their documentation to reflect what they actually do, and not make these irrelevant, misleading, and/or inaccurate references to 437 and to a 19-year-old version of 10646. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwellshy;
The term complex script in OOXML spec (Re: [unicode] Re: Character set cluelessness)
Dear Richard, There had been long discussion about OOXML's complex script in JTC1/SC34, since 2009. https://skydrive.live.com/view.aspx/Public%20Documents/2009/DR-09-0040.docx?cid=c8ba0861dc5e4adcsc=documents I expect the next corrigendum or amendment will describe more about it. Unfortunately, the implementation referred by complex script would be different what Unicode experts remind from this word. Regards, mpsuzuki Martin J. Dürst wrote (2012/10/03 9:37): Richard - Complex script usually refers to scripts where rendering isn't just simply putting glyphs side by side. That includes stuff with combining marks, ligatures, reordering, stacking, and the like. Regards, Martin. On 2012/10/03 7:09, Richard Wordingham wrote: On Tue, 02 Oct 2012 09:14:08 -0700 Doug Ewelld...@ewellic.org wrote: It's 2012. How does one get through to folks like this? Even people who should know better can get confused about character sets. Does anyone know what 'a complex script Unicode range' is? It's a term that occurs in the Office Open XML specification, but I can't find a definition for it. It's just possible that it means a range where hypothetically unassigned characters would not be left-to-right, but I've a feeling it ought to include Vietnamese characters for all that they're Latin script. Possibly the definitions have not been provided because the concept ought to involve the tricky task of breaking text runs into script runs. (Lots of people feel one should be able to add script-specific combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7 MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X. U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil scripts, to name but a few.) Richard.